A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » General Hardware & Peripherals » Homebuilt PC's
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

When is ECC not ECC ?



 
 
Thread Tools Display Modes
  #1  
Old May 19th 04, 04:59 AM
Gianni Mariani
external usenet poster
 
Posts: n/a
Default When is ECC not ECC ?


I started getting instabilities on my homebuilt dual Athlong 2400 on an
MSI K7D Master. I run Linux on the box and I started getting kernel panics.

The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting
are set to "Error Correct" so I thought that this would be adequate to
deal with memory issues.

The failures started a while ago with a program I wrote a while ago
called "cpulat" which measures the CPU-CPU memory latentcy. It would
crash in unexpected nonsensical places and after trying to debug it for
a while, I just gave up. Then a few months later, the machine
mysteiously hung, followed by a succsession of kernel panics with
allways the same error message.

In the process of trying to diagnose the problem the mainboard stopped
responding to the keyboard and mouse which led to swapping out the
mainboard. Kernel panics still persisted and for a joke I swapped the
CPU's and finally I pulled one of the memory sticks and bingo, the
machine was now stable. I picked up a replacement stick and now it is
working properly and I now can't reproduce the cpulat errors either.

I've built probably 30+ PC's in my time and I have never seen this kind
of behaviour.

So the question that still remains for me is why didn't the ECC error
recovery/check pick this up ?


  #2  
Old May 19th 04, 01:15 PM
Bob Day
external usenet poster
 
Posts: n/a
Default

"Gianni Mariani" wrote in message
...

I started getting instabilities on my homebuilt dual Athlong 2400 on an
MSI K7D Master. I run Linux on the box and I started getting kernel panics.

The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting
are set to "Error Correct" so I thought that this would be adequate to
deal with memory issues.

The failures started a while ago with a program I wrote a while ago
called "cpulat" which measures the CPU-CPU memory latentcy. It would
crash in unexpected nonsensical places and after trying to debug it for
a while, I just gave up. Then a few months later, the machine
mysteiously hung, followed by a succsession of kernel panics with
allways the same error message.

In the process of trying to diagnose the problem the mainboard stopped
responding to the keyboard and mouse which led to swapping out the
mainboard. Kernel panics still persisted and for a joke I swapped the
CPU's and finally I pulled one of the memory sticks and bingo, the
machine was now stable. I picked up a replacement stick and now it is
working properly and I now can't reproduce the cpulat errors either.

I've built probably 30+ PC's in my time and I have never seen this kind
of behaviour.

So the question that still remains for me is why didn't the ECC error
recovery/check pick this up ?


Despite the BIOS setting, perhaps the chipset or the mainboard does
not, in fact, support ECC memory.

-- Bob Day


  #3  
Old May 24th 04, 07:37 AM
Erez Volach
external usenet poster
 
Posts: n/a
Default


"Gianni Mariani" wrote in message
...

I started getting instabilities on my homebuilt dual Athlong 2400 on an
MSI K7D Master. I run Linux on the box and I started getting kernel

panics.

The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting
are set to "Error Correct" so I thought that this would be adequate to
deal with memory issues.

The failures started a while ago with a program I wrote a while ago
called "cpulat" which measures the CPU-CPU memory latentcy. It would
crash in unexpected nonsensical places and after trying to debug it for
a while, I just gave up. Then a few months later, the machine
mysteiously hung, followed by a succsession of kernel panics with
allways the same error message.

In the process of trying to diagnose the problem the mainboard stopped
responding to the keyboard and mouse which led to swapping out the
mainboard. Kernel panics still persisted and for a joke I swapped the
CPU's and finally I pulled one of the memory sticks and bingo, the
machine was now stable. I picked up a replacement stick and now it is
working properly and I now can't reproduce the cpulat errors either.

I've built probably 30+ PC's in my time and I have never seen this kind
of behaviour.

So the question that still remains for me is why didn't the ECC error
recovery/check pick this up ?


ECC can detect almost any error, but prolly can correct only a handfull of
1-bit errors. If the code is too corrupt it cannot overcome it with limited
information provied by the correction bits. I don't know what it would do in
such extreme situations. Perhaps that one stick of memory was not seated
well, and contact / impedance / resistance issues caused errors in data
transmission. Also, I would think the ECC works inside the memory banks /
modules, so it would not detect errors on the memory bus. That is a part
that (should be) handled by the north bridge of the motherboard.


 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump


All times are GMT +1. The time now is 12:11 AM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.