PDA

View Full Version : Re: Memtest86 errors


Boudewijn
August 2nd 03, 06:58 PM
Hello,
It could be that you need to relax the timings a bit with the second stick
in.
Are both sticks running fine separately, i.e. no errors in Memtest?
If so, try relaxing the timings and look again.
Regards, Boudewijn


"Bob Davis" > wrote in message
...
> System: Gigabyte 8KNXP, P4 2.8C, 2gb RAM (Kingston DDR400
KVR400X64C3A/512)
> , not overclocked.
>
> When running Memtest86 v3 I get random errors at the very end of test #5.
> The "good" is always one of the following: ff7fffff, ffefffff, fff7ffff,
or
> fffeffff--and the "bad" is always ffffffff.
>
> A second GB of RAM was installed two days ago, and the errors started at
> that time. Previous tests with 1gb did not manifest errors. The address
> location of the errors is random throughout the 2gb, so it apparently
isn't
> in the new modules. Error count can be as low as two and as high as 20,
and
> they always occur just as Test #5 is completing and transitioning to Test
> #6. I reversed the two new modules without effect, but since I use a
Zalman
> HSF the Bank 0, Slot 1 module appears not to be removable without moving
or
> removing the HSF, I haven't tried the new memeory alone.
>
> Anyone have an idea what's going on here? The computer runs fine and I've
> had no trouble yet in real-world experience.
>
>

Bob Davis
August 2nd 03, 11:42 PM
"Frank" > wrote in message
m...

> When memtest86 shows errors something is bad. Now the job
> is to find out what is bad. Which stick or even the CPU. Read
> the documentation on their site.


See the quote from the Memtest86 site I sent in reply to Boudewijn. My
problem sounds like what's happening in this description.

Tim
August 3rd 03, 01:10 AM
Bob,

If memtest86 were referencing memory that did not exist, surely all the
tests would report errors in the same address range.

What I think you are getting are bit position memory errors. I'll explain it
briefly - as an engineer once explained it to me.

If memory is arranged in a straight array (it is not necessarily), then a
bit error can occur when adjacent bits are set due to manufacturing flaws.

EGG

11111011 - what is written
11111111 - what is read.

OR

00000100 - what is written
00000000 - what is read


were the memory cells are adjacent, a write to adjacent bits can cause a
Flip of a bit that is supposed to be zero. The flip can occur from 0 to 1 or
from 1 to 0. Each cell (bit) in a DRAM (dynamic random access memory - or
dram derived memory) is simply a tiny capacitor that stores charge. (SRAM
uses transistors and transistor logic to store bits so takes considerable
logic to store one bit). A positive charge in the cell (capacitor) may
indicate a 1, and a negative may indicate a zero. Since the cells are
capacitors and are extremely tiny the charge dissipates quickly - eventually
coming to zero if the cell is not refreshed with the correct charge in time.
The process of doing a read in dram is destructive. IE rather than have
complex electronics that measures the now low voltage of the cell to see if
it is slighlty + or slightly -, the read process does a write of either a +
or - charge (depends on design) - the result is either a clearly readable
big + or a clearly readable nearer to zero. IE if the cell contained a + and
a + is written, then a + is clearly there after. If the cell contains a -
and a + is written, the same + won't be there, only a partial + will be. So,
if the result is not a clearly + the logic puts a minus *back* in to correct
things so refreshing the correct charge and reports a minus which may mean a
zero or one depending on design. In dram, the memory has to be refreshed
very regularly - this is part of the reason for so many odd looking timing
numbers.

During a read process (which is a refresh-write), or a write process, if
there are manufacturing flaws EG two conductors or two capacitors touch, or
the capacitors are not insulated from each other well enough an adjacent bit
can get the same treatment during the refresh although it is not being
addressed - a Bit Flip.

You seem (I say Seem) to be getting bit flips.

Memtest86 is designed to look for bit flips - this is why it writes patterns
like '11111110', '11111101', '11111011' etc as well as '10000000',
'01000000', '00100000' etc. to provoke bit flip errors.

If you can move any of the new memory cards around then if the *addresses*
reported move, then the flaw is in the chip. If it doesn't move - given what
you say about your memory timings being quite relaxed, then I am not too
sure - perhaps a mobo problem. At that point I would either return the
memory or gain further evidence and put it into the first bank to prove what
is going on.

If the errors are quite consistent and move, then I would be very tempted to
say from this distance that the memory is actually stuffed.

- Tim



"Bob Davis" > wrote in message
...
>
> "Frank" > wrote in message
> m...
>
> > When memtest86 shows errors something is bad. Now the job
> > is to find out what is bad. Which stick or even the CPU. Read
> > the documentation on their site.
>
>
> See the quote from the Memtest86 site I sent in reply to Boudewijn. My
> problem sounds like what's happening in this description.
>
>

Bob Davis
August 3rd 03, 06:11 AM
"Tim" > wrote in message ...

> 11111011 - what is written
> 11111111 - what is read.
>
> OR
>
> 00000100 - what is written
> 00000000 - what is read


Thanks for that thoughtful and informative explanation. This is what's
happening in every case. My most recent test showed two iterations of a
00000100 written and 00000000 read, which is the first occurrence outside
the ff....ff range. This indicated 100 bit errors, while the f.......
series always shows either 10000 or 20000 bit errors. After reading the
author's explanation quoted in a previous message, I'm leaning toward his
explanation of software confusion, especially after seeing the results of
the tests I've listed below.

> You seem (I say Seem) to be getting bit flips.
>
> Memtest86 is designed to look for bit flips - this is why it writes
patterns
> like '11111110', '11111101', '11111011' etc as well as '10000000',
> '01000000', '00100000' etc. to provoke bit flip errors.
>
> If you can move any of the new memory cards around then if the *addresses*
> reported move, then the flaw is in the chip. If it doesn't move - given
what
> you say about your memory timings being quite relaxed, then I am not too
> sure - perhaps a mobo problem. At that point I would either return the
> memory or gain further evidence and put it into the first bank to prove
what
> is going on.


The error addresses are different each time the test is run even when I
don't move the modules around.

Keep in mind that I have six slots on this mobo. I first swapped the new
modules (Slot 2 to Slot 5 and vice versa), and that revealed no change.
Then I changed the memory timings in the bios from 3-8-3-3 (SPD) to
2-7-3-3, which manifested fewer errors, and while the "bad" were always
ffffffff before, the "good" were then ffffffff, a reversal of position.
Later with this arrangement the order swapped again back to original. I
then changed the timings to 3-10-4-4, which is about as relaxed as my bios
allows, and that revealed no change.

Next, I removed the old memory and installed only the new in Slot 1 and Slot
4, and received no errors. So, while the two pairs used separately show no
errors while in the first slot of each bank, they manifest these random
errors on Test 5 when used together. This indicates to me that the physical
RAM is not defective.

That naturally lead me to think something might be wrong with the second
slot of one or both banks (Slots 2 and/or 5), so I removed the sticks in
Slots 1 and 3 and that manifested no errors. That should eliminate the
possibility of bad slots. So if the slots and modules are good, other than
a software issue I'm dumbfounded as to what could be causing the errors.