What is ECC SDRAM? ECC (error correction code) SDRAM is memory
that is able to detect and correct some SDRAM errors without user
intervention. ECC SDRAM replaced parity memory which could only detect,
but not correct, SDRAM errors.
What causes SDRAM errors? Per Dell,
"Memory errors are characterized as hard or soft. Hard errors
are caused by defects in the silicon or metalization of the SDRAM
package, and are usually permanent once they manifest. Soft errors
are caused by charged particles or radiation, and are transient. In
the past, soft errors were primarily caused by alpha particles, but
that failure mode has been mostly eliminated today by strict quality
control of the packaging material by SDRAM vendors. Currently the
primary source of soft errors in SDRAM is electrical disturbance caused
by cosmic rays, which are very high-energy subatomic particles originating
in outer space."
 |
A typical PC 133, 9 chip, 16 x 64, 128 MB
ECC SDRAM.
|
What happens when a SDRAM crash occurs?
When main memory crashes, all data in memory is lost. The larger the
amount of main memory on the computer, the greater the possibility
of nonrecoverable data loss.
What kind of errors can ECC SDRAM correct?
Most ECC SDRAM can correct single bit errors, and detect, but not
correct larger errors. Thus, errors greater in size than 1 bit will
still crash the computer.
Chipkill was invented to augment ECC
DRAM. Large server manufacturers have implemented additional error
correcting hardware capabilities with a technology known as Chipkill.
Per Dell, "Chipkill correct is the ability of the memory system
to withstand a multibit failure within a SDRAM device, including a
failure that causes incorrect data on all data bits of the device.
These methods rely on the chip set and hardware architecture of the
system and cannot be achieved through software upgrades."
So what is the possibility of data loss?
The data shown below illustrates the results of an IBM analysis comparing
server outages due to memory failures of parity, ECC and Chipkill-equipped
servers.
In summary, the following outage rates were identified:
A 32MB parity memory-equipped server received 7 outages
per 100 servers over 3 years.
The 1GB ECC memory-equipped server received 9 outages
per 100 servers over 3 years.
The 4GB Chipkill-equipped server received 6 outages
per 10,000 servers over 3 years.
It can be seen that the Chipkill equipped sever had a failure rate
of a magnitude of over 10 times lower than regular ECC SDRAM. Also,
remember that the more system memory a computer has, the more likely
it will crash due to a memory error.
What about speed? I could find no conclusive
evidence that ECC SDRAM performed any slower than non-ecc SDRAM. Both
Dell and IBM stated in their referenced articles there was no speed
penalty to use a Chipkill enhanced server instead of an ECC memory
equipped server without Chipkill.
So who should buy ECC SDRAM? First, the
average user should be frequently saving data to their hard drive,
so the likelihood of catastrophic memory failure should be small and
therefore ECC memory would be overkill.
Second, if you are thinking of running a server, you definitely want
to have a working RAID disk array, as your hard drives are much more
likely to fail then your memory.
Third, if you want to run a server, there is no reason not to have
ECC memory if your motherboard supports it. Currently ECC SDRAM only
costs a little bit more than regular SDRAM.
Referenced Articles
IBM
Chipkill Memory - IBM, February 1999
Chipkill
Correct Memory Architecture - Dell, August 2000