Difference between revisions of "Main Page"

From EdacWiki
Jump to navigation Jump to search
Line 26: Line 26:
 
Without the EDAC modules, on most current Linux systems:
 
Without the EDAC modules, on most current Linux systems:
   
* You may be experiencing PCI data corruption (e.g. corrupted network, or storge I/O), and not know about it, as most systems do not check PCI devices for reported PCI parity errors (some may trigger an NMI, but you have no more info about what caused the NMI).
+
* You may be experiencing PCI data corruption (e.g. your data is being corrupted between whilst travelling to/from your NIC/storage adapter, whilst on the PCI bus), and not know about it, as most systems do not check PCI devices for reported PCI parity errors (some may trigger an NMI, but you have no more info about what caused the NMI).
* If you have ECC memory, and you are experiencing correctable ECC errors, you probably won't know anything about it (the memory may later fail completely, and you won't know anything until your systems recieves an NMI and/or crashes).
+
* If you have ECC memory, and you are experiencing correctable ECC errors, you probably won't know anything about it (the memory may later fail completely, and you won't know anything until your systems recieves an NMI and/or crashes), with EDAC, you get to know about bad memory modules before the errors become uncorrectable, and you have potentially corrupted data - including memory modules which are bad as-shipped, before such systems are put into service.
 
* If you have a motherboard which claims to support ECC, but the BIOS is not correctly enabling ECC mode, you won't know anything about it (until your machine crashes with unexplained memory errors - you won't even get an NMI, and the extra money spent on ECC memory will be wasted).
 
* If you have a motherboard which claims to support ECC, but the BIOS is not correctly enabling ECC mode, you won't know anything about it (until your machine crashes with unexplained memory errors - you won't even get an NMI, and the extra money spent on ECC memory will be wasted).
   

Revision as of 14:31, 24 March 2006

EDAC Wiki

This is a wiki for the Linux EDAC project

What is it?

EDAC Stands for "Error Detection and Correction". The Linux EDAC project comprises of a series of Linux kernel modules, which make use of error detection facilities of computer hardware, currently hardware which detects the following errors is supported:

  • System RAM errors (this is the original, and most mature part of the project) - many computers support RAM EDAC, (especially for chipsets which are aimed at high-reliability applications), but RAM which has extra storage capacity ("ECC RAM") is needed for these facilities to operate
  • PCI bus transfer errors - the majority of PCI bridges, and peripherals support such error detection

Why do I need it?

Without the EDAC modules, on most current Linux systems:

  • You may be experiencing PCI data corruption (e.g. your data is being corrupted between whilst travelling to/from your NIC/storage adapter, whilst on the PCI bus), and not know about it, as most systems do not check PCI devices for reported PCI parity errors (some may trigger an NMI, but you have no more info about what caused the NMI).
  • If you have ECC memory, and you are experiencing correctable ECC errors, you probably won't know anything about it (the memory may later fail completely, and you won't know anything until your systems recieves an NMI and/or crashes), with EDAC, you get to know about bad memory modules before the errors become uncorrectable, and you have potentially corrupted data - including memory modules which are bad as-shipped, before such systems are put into service.
  • If you have a motherboard which claims to support ECC, but the BIOS is not correctly enabling ECC mode, you won't know anything about it (until your machine crashes with unexplained memory errors - you won't even get an NMI, and the extra money spent on ECC memory will be wasted).

Help!

About the Errors that EDAC generates

If the EDAC subsystem is reporting errors on your system, please see WhyAmIgettingMemoryErrors, and WhyAmIgettingPciErrors. Please try and check out the possibilities listed here, and elsewhere on this wiki, before you either open a new bug report, or post to the mailing list.

The EDAC Bug Database

If you think you've found a bug, please search the EDAC Bugzilla to see if it has already been reported (you can then add yourself to the cc list for that bug, so that you are automatically informed of updates etc.), if it hasn't, then please create a new bug report.

The EDAC Mailing List

Most of the EDAC developers keep an eye on the EDAC mailing list (hosted by Sourceforge) to a greater or lesser extent, but please remember that not many of them work on EDAC as part of their job, (and if they do, then they are paid to keep their employer's systems running), so check the Wiki, the bug database, and the mailing list archives for your problem first. If you have exhausted these possibilities, then by all means post to the mailing list!

  • Be polite
  • Please make sure you give all information which might be relevant e.g. your (exact) kernel version
  • Be patient

If you get a reply, or find things out which weren't know about before, please add the information to this Wiki, in order to help others.

Status

The EDAC code is in Linux Kernel version 2.6.16. The userspace API (via sysfs) is still a work in progress, and is not expected to firm-up until 2.6.17, please contribute to this effort, and help develop the necessary userspace tools (see below).

History

The EDAC project was renamed from the "bluesmoke" prior to submission to the mainline Linux kernel. The Bluesmoke code was created by Thayne Harbaugh. The Linux-ECC project was EDAC's predecessor and its major inspiration. Developed by Dan Hollis and others, the Linux-ECC project is no longer maintained.

Supported Hardware

System Main Memory EDAC

Supported Memory Controllers

Please see the individual driver pages for information on supported revisions, motherboard-specific information etc.

Manufacturer Model EDAC Driver Tech Docs Controller Capabilities Status
AMD Opteron k8_edac.c AMD EDAC, ErrorScrub, BackgroundScrub Supported Development Tree
AMD Athlon64 k8_edac.c AMD EDAC, ErrorScrub, BackgroundScrub Supported Development Tree
AMD AthlonFX k8_edac.c AMD EDAC, ErrorScrub, BackgroundScrub Supported Development Tree
AMD 760 amd76x_edac.c AMD Supported (Linux 2.6.16)
AMD 762 amd76x_edac.c AMD Supported (Linux 2.6.16)
AMD 768 amd76x_edac.c AMD Supported (Linux 2.6.16)
Intel e7500 e7xxx_edac.c Supported (Linux 2.6.16)
Intel e7501 e7xxx_edac.c Supported (Linux 2.6.16)
Intel e7505 e7xxx_edac.c Supported (Linux 2.6.16)
Intel e7520 e752x_edac.c Supported (Linux 2.6.16)
Intel e7525 e752x_edac.c Supported (Linux 2.6.16)
Intel 82875p i82875p_edac.c EDAC Supported (Linux 2.6.16)
Intel e7210 Supported (Linux 2.6.16)
Intel 82860 i82860_edac.c Supported (Linux 2.6.16)
Intel 82443BX/GX(440BX/GX) i82443bxgx_edac.c Intel EDAC, ErrorScrub Alpha driver (see mailing list)
Radisys 82600 r82600_edac.c Radisys EDAC, ErrorScrub Supported (Linux 2.6.16)

Customisation for your Hardware

For many chipsets and motherboards, there is no consistant relationship between the memory banks/slots as made available to the EDAC driver, and the physical labels present next to the memory module socket. You can help by working out the relationship for your hardware, and adding the info to the MemorySlotLabels page.

PCI Error Reporting

PCI Parity error reporting facilities are included in the PCI specification, and the majority of add-in cards (and chips which are capable of being included in either add-in, or on-motherboard designs) support the PCI parity error detection, and reporting functionality. Some "fake" PCI devices which are not physically connected by a PCI bus (such as e.g. some ATA host adaptors which are built-in to a motherboard chipset) typically do not include the functionality.

Error Detection Overhead

The driver currently only support error detection via polling. Polling all of the PCI devices' error status registers can be time consuming, especially on machines which have many devices. You may wish to slow the error polling rate, or disable it altogether on such systems.

Faulty Hardware

Some PCI devices (or just particular revisions of those devices) are broken with respect to PCI parity detection, and display false positives. You can check (and add to) the list of broken devices on the PCIDevicesWithBrokenParityDetection page.

Help Wanted

Please feel free to:

  • Improve this documentation
  • HowToWriteNewMemoryControllerDrivers
  • Test the code
  • Report broken hardware for the blacklists
  • Create memory slot entries for your hardware
  • Create some user-space code (e.g. scripts to go in a cron job, extensions to SNMP daemons etc. etc.)
  • Create a script to generate dimm labels, whitelists from the WIKI contents

Related Articles

Sourceforge web page - [1]

An overview of EDAC technologies on Wikipedia [2]

The original Linux ECC project (Dan Hollis et al) - [3]

How to use this site

A Wiki is a collaborative site, anyone can contribute and share:

  • Edit any page by pressing Edit at the top or the bottom of the page
  • Create a link to another page with joined capitalized words (like WikiSandBox) or with [[quoted words in brackets]]
  • Search for page titles or text within pages using the search box at the top of any page
  • See HelpForBeginners to get you going, HelpContents for all help pages.

To learn more about what a WikiWikiWeb is, read about MoinMoin:WhyWikiWorks and the MoinMoin:WikiNature. Also, consult the MoinMoin:WikiWikiWebFaq.

This wiki is powered by MediaWiki.