If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
When is ECC not ECC ?
I started getting instabilities on my homebuilt dual Athlong 2400 on an MSI K7D Master. I run Linux on the box and I started getting kernel panics. The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting are set to "Error Correct" so I thought that this would be adequate to deal with memory issues. The failures started a while ago with a program I wrote a while ago called "cpulat" which measures the CPU-CPU memory latentcy. It would crash in unexpected nonsensical places and after trying to debug it for a while, I just gave up. Then a few months later, the machine mysteiously hung, followed by a succsession of kernel panics with allways the same error message. In the process of trying to diagnose the problem the mainboard stopped responding to the keyboard and mouse which led to swapping out the mainboard. Kernel panics still persisted and for a joke I swapped the CPU's and finally I pulled one of the memory sticks and bingo, the machine was now stable. I picked up a replacement stick and now it is working properly and I now can't reproduce the cpulat errors either. I've built probably 30+ PC's in my time and I have never seen this kind of behaviour. So the question that still remains for me is why didn't the ECC error recovery/check pick this up ? |
#2
|
|||
|
|||
"Gianni Mariani" wrote in message
... I started getting instabilities on my homebuilt dual Athlong 2400 on an MSI K7D Master. I run Linux on the box and I started getting kernel panics. The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting are set to "Error Correct" so I thought that this would be adequate to deal with memory issues. The failures started a while ago with a program I wrote a while ago called "cpulat" which measures the CPU-CPU memory latentcy. It would crash in unexpected nonsensical places and after trying to debug it for a while, I just gave up. Then a few months later, the machine mysteiously hung, followed by a succsession of kernel panics with allways the same error message. In the process of trying to diagnose the problem the mainboard stopped responding to the keyboard and mouse which led to swapping out the mainboard. Kernel panics still persisted and for a joke I swapped the CPU's and finally I pulled one of the memory sticks and bingo, the machine was now stable. I picked up a replacement stick and now it is working properly and I now can't reproduce the cpulat errors either. I've built probably 30+ PC's in my time and I have never seen this kind of behaviour. So the question that still remains for me is why didn't the ECC error recovery/check pick this up ? Despite the BIOS setting, perhaps the chipset or the mainboard does not, in fact, support ECC memory. -- Bob Day |
#3
|
|||
|
|||
"Gianni Mariani" wrote in message ... I started getting instabilities on my homebuilt dual Athlong 2400 on an MSI K7D Master. I run Linux on the box and I started getting kernel panics. The system has 2 512MB sticks of Samsung ECC memory and the BIOS setting are set to "Error Correct" so I thought that this would be adequate to deal with memory issues. The failures started a while ago with a program I wrote a while ago called "cpulat" which measures the CPU-CPU memory latentcy. It would crash in unexpected nonsensical places and after trying to debug it for a while, I just gave up. Then a few months later, the machine mysteiously hung, followed by a succsession of kernel panics with allways the same error message. In the process of trying to diagnose the problem the mainboard stopped responding to the keyboard and mouse which led to swapping out the mainboard. Kernel panics still persisted and for a joke I swapped the CPU's and finally I pulled one of the memory sticks and bingo, the machine was now stable. I picked up a replacement stick and now it is working properly and I now can't reproduce the cpulat errors either. I've built probably 30+ PC's in my time and I have never seen this kind of behaviour. So the question that still remains for me is why didn't the ECC error recovery/check pick this up ? ECC can detect almost any error, but prolly can correct only a handfull of 1-bit errors. If the code is too corrupt it cannot overcome it with limited information provied by the correction bits. I don't know what it would do in such extreme situations. Perhaps that one stick of memory was not seated well, and contact / impedance / resistance issues caused errors in data transmission. Also, I would think the ECC works inside the memory banks / modules, so it would not detect errors on the memory bus. That is a part that (should be) handled by the north bridge of the motherboard. |
Thread Tools | |
Display Modes | |
|
|