HowMemoryEdacHardwareWorks

From EdacWiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

disclaimer

this was quickly written up by REW to complete this wiki.

How Memory EDAC works

Consider if you have to send 10 single digit numbers to a friend. You write them down, the friend reads them back from that piece of paper and does whatever he has to do with it. This is error prone: any sloppy writing will cause the numbers on the other end to be erroneous.

So now we write down not only the 10 single digit numbers, but also the SUM of these 10 numbers. Only a small overhead, but now some types of errors can be detected. Reading a 1 instead of a 7 will result in the sum not being correct. This is called "error detection". The ED in "EDAC".

Now when the sum isn't correct, the person reading the slip of paper can go back and look for possible errors like the one suggested above. If now the sum comes out correct, he can correctly deduce the original numbers. This is the "and correction", or AC part in "EDAC".

computers and EDAC

All numbers in computers are represented in binary. Just ones and zeroes. Now instead of 64 bits they have implemented 72 bits. Besides the actual value, they also calculate 8 more bits. By doing some interesting math they have figured out a way to generate those 8 bits such that if we flip any ONE of those 72 bits, we can deduce which one it was!

However this cannot go on forever. After flipping a few bits, we'll arrive at another combination of bits that would be another correct value. Most often a code is used that can correct any single bit error, and detect any 2-bit errors.

an example

Suppose we have 64 bits to protect. We could store just one extra bit such that there is always an EVEN number of ones. This is called parity. If any single bit flips, we'll notice: the total number of ones must now be odd. Flipping any other bit will again give us a valid code, but now for a different correct value. "single error detect" (and no "correct").

Now if I do this trick on the first 32 bits and the second 32 bits separately, when an error occurs, I can tell which half the error occurred. I still don't know where in each half, but at least I know more than just "an error occurred".

Now if I split my 64 bits into two groups in a different way I could add two more parity bits I would also know in which of these two groups the error occurred. These groups are formed by splitting the previous two groups exactly down the middle. The first halves go into the first group of the next step, the second halves into the second group.

This way we can go on splitting and grouping 4 more times. Calculating two extra bits every time. In total we get 12 extra bits. When an error occurs (inside our "data area"), exactly 6 parity calculations will be wrong, and exactly pinpoint the bit that was flipped. (actually this algorithm is used on 4096 bits at a time in flash memories. In that case 24 extra bits are needed/used).

As said before actual hardware does it with only 8 bits. Theory predicts that it might be possible to do with only 7 bits. However this is an awkward number for computers so they use 8.