voltage stress an margin test of system stability

**[email protected]** · #1 February 3rd 09, 07:40 AM posted to alt.comp.hardware

Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?

**kony** · #2 February 3rd 09, 08:42 AM posted to alt.comp.hardware

On Mon, 2 Feb 2009 23:40:58 -0800 (PST),
wrote:

Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

What operating system? Normally, something low-level, like
a driver, is to blame for complete lockup versus a (windows)
bluescreen or crash, or a hardware problem like PSU problem
or mobo capacitors (power) instability.

Is there any (windows?) OS log that shows a fault, like in
Event Viewer... though Event Viewer in windows will show
lots of errors, it would be something seemingly coinciding
with the moment of the lockup.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

This makes it seem less likely a power problem and moreso a
driver problem.

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

Perhaps, but if a chip has a voltage spec, there is no
guarantee it will work below that spec. I suggest the
opposite, if your board has options to increase voltage, do
so in a small amount (within chip and cooling margins) and
see if that reduces the problem rate.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

See above comment, although it's limiting you can still
effectively check this by increasing voltage.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Personally, I do test by overclocking (though I often
overclock anything with reasonably margin to do so). I
overclock till I reach a determination of the threshold for
instability, at any particular bus speed, timings, or
voltage (whatever appears to be the limiting factor), and
after finding that max threshold, for normal use I reduce
the speed or increase the voltage as applicable to retain a
margin between max stable settings and target operating
parameters. Normally, (rare if ever exceptions like if a
motherboard or PSU had bad capacitors, or if the system was
unattended and so it didn't have dust cleaned out in a
timely fashion) this strategy has worked well.

Any ideas or experiences that pertain to this matter?

Lockups tend to come from hardware failure or drivers. More
common items to fail include video card, motherboard, PSU.
Try to isolate each if you have spare parts, or check
temperatures, voltages, and that fans are functioning. As
for drivers, try newer ones if they are not current.

Try to find a commonality in these freezes, if particular
things are running (apps), or system functions are being
used. Note the interval and whether it depends on time
(like if system had sat long enough to go into a lower power
managed state).

**Paul** · #3 February 3rd 09, 10:01 AM posted to alt.comp.hardware

wrote:
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?

Creating the voltage versus frequency curve, is what
overclockers (or underclockers) do. For example, on
my latest purchase, I know that an extra 0.1V on Vcore,
allows a 33% overclock. By proceeding in small steps of
frequency, and adjusting Vcore for the "same level of
stability" for each test point, you can produce your
own voltage versus frequency curve. On an older
processor (Northwood), I got to see the "brick wall
pattern", where at a certain point, all the extra
(safe) voltage that could be applied, didn't allow
any higher overclock.

In terms of features, AMD and Intel have Cool N' Quiet (CNQ)
and Enhanced SpeedStep (EIST). Depending on OS loading,
if these features are enabled, the voltage and frequency are
changed dynamically, at up to 30 times per second. The
multiplier might vary between 6X and 9X say, with some
small difference in Vcore applied to those two conditions,
according to the manufacturer's declaration of what is
enough to make it work.

So if you are having stability issues, your first step is to
disable CNQ or EIST. The purpose of doing that, is not to
blame those features for the stability issue (as they're not
likely to be the problem), but to make the test conditions
a stable, known quantity. You want just one frequency involved,
when doing a test case, as you're attempting to do
characterization.

On my processor, I believe the Vcore setting is policed by the
processor. My Core2 has VID bits, to drive the Vcore regulator.
And by using tools that can control the multiplier setting, and
drive out new Vcore values while the system is running, the
processor seems to have an upper limit set, as to what bit
pattern it will allow to be passed on the VID bits. That
prevents any useful level of overvolting on my newest system.
Previous generations of systems, used things like overclock
controller chips, to allow "in-band" VID changes.

On some motherboards, you may notice the nomenclature "+0.1V"
for a Vcore setting. Rather than a more direct "1.300V" setting
in the BIOS. I interpret this to mean, the motherboard design has
a feature to bump Vcore, independent of the VID bits. So the
"+0.1V" thing is meant to imply an offset applied in the
Vcore regulator. I had to do something similar to my motherboard
with a soldering iron. I now have a socket, where I can fit
a 1/4W resistor, and by varying the value, I get a voltage boost.
My motherboard is unlike some other brands, in not offering
any out-of-band voltage boost feature. So I had to implement
my own, using instructions from other users who did the
analysis before me. You likely won't have to go through this.
I'm explaining this, in case you cannot reconcile what is
happening while you're testing (setting says one thing,
measured value is some other value). If the set value and
the measured value don't match, part of that difference is
due to "droop", and part can be because of a boost which is
applied independent of the VID bits.

As Kony says, a driver could be responsible for the problem.
The Mersenne Prime95 test is pretty good at finding bad
RAM, and since you've run that for a few hours, that
helps to eliminate bad memory. Prime95 can only test the
memory which is separate from the portion used by the OS,
so it is possible there are still some areas of the RAM
that have not been tested as thoroughly.

Other things that might freeze, might be a misadjusted
bus multiplier, like what is used for Hypertransport between
processor and Northbridge. Or a SATA or IDE clock which
is too far from nominal. So clock signals to other
hardware parts in your system, could give a freezing
symptom. Data or some transaction to the processor
could be frozen, and the processor might still be
running.

Another comment - I've noticed on my older overclocking
test projects, that the processor would crash on an
error. My current Core2 system tends to freeze, rather than
giving an old-fashioned blue screen. So there can be
some differences from one generation to another, as
to what part of the processor is failing, and whether
the system runs long enough to splatter something
across the screen.

Paul

**[email protected]** · #4 February 3rd 09, 05:49 PM posted to alt.comp.hardware

On Feb 3, 2:01*am, Paul wrote:
wrote:
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several * programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?

Creating the voltage versus frequency curve, is what
overclockers (or underclockers) do. For example, on
my latest purchase, I know that an extra 0.1V on Vcore,
allows a 33% overclock. By proceeding in small steps of
frequency, and adjusting Vcore for the "same level of
stability" for each test point, you can produce your
own voltage versus frequency curve. On an older
processor (Northwood), I got to see the "brick wall
pattern", where at a certain point, all the extra
(safe) voltage that could be applied, didn't allow
any higher overclock.

In terms of features, AMD and Intel have Cool N' Quiet (CNQ)
and Enhanced SpeedStep (EIST). Depending on OS loading,
if these features are enabled, the voltage and frequency are
changed dynamically, at up to 30 times per second. The
multiplier might vary between 6X and 9X say, with some
small difference in Vcore applied to those two conditions,
according to the manufacturer's declaration of what is
enough to make it work.

So if you are having stability issues, your first step is to
disable CNQ or EIST. The purpose of doing that, is not to
blame those features for the stability issue (as they're not
likely to be the problem), but to make the test conditions
a stable, known quantity. You want just one frequency involved,
when doing a test case, as you're attempting to do
characterization.

On my processor, I believe the Vcore setting is policed by the
processor. My Core2 has VID bits, to drive the Vcore regulator.
And by using tools that can control the multiplier setting, and
drive out new Vcore values while the system is running, the
processor seems to have an upper limit set, as to what bit
pattern it will allow to be passed on the VID bits. That
prevents any useful level of overvolting on my newest system.
Previous generations of systems, used things like overclock
controller chips, to allow "in-band" VID changes.

On some motherboards, you may notice the nomenclature "+0.1V"
for a Vcore setting. Rather than a more direct "1.300V" setting
in the BIOS. I interpret this to mean, the motherboard design has
a feature to bump Vcore, independent of the VID bits. So the
"+0.1V" thing is meant to imply an offset applied in the
Vcore regulator. I had to do something similar to my motherboard
with a soldering iron. I now have a socket, where I can fit
a 1/4W resistor, and by varying the value, I get a voltage boost.
My motherboard is unlike some other brands, in not offering
any out-of-band voltage boost feature. So I had to implement
my own, using instructions from other users who did the
analysis before me. You likely won't have to go through this.
I'm explaining this, in case you cannot reconcile what is
happening while you're testing (setting says one thing,
measured value is some other value). If the set value and
the measured value don't match, part of that difference is
due to "droop", and part can be because of a boost which is
applied independent of the VID bits.

As Kony says, a driver could be responsible for the problem.
The Mersenne Prime95 test is pretty good at finding bad
RAM, and since you've run that for a few hours, that
helps to eliminate bad memory. Prime95 can only test the
memory which is separate from the portion used by the OS,
so it is possible there are still some areas of the RAM
that have not been tested as thoroughly.

Other things that might freeze, might be a misadjusted
bus multiplier, like what is used for Hypertransport between
processor and Northbridge. Or a SATA or IDE clock which
is too far from nominal. So clock signals to other
hardware parts in your system, could give a freezing
symptom. Data or some transaction to the processor
could be frozen, and the processor might still be
running.

Another comment - I've noticed on my older overclocking
test projects, that the processor would crash on an
error. My current Core2 system tends to freeze, rather than
giving an old-fashioned blue screen. So there can be
some differences from one generation to another, as
to what part of the processor is failing, and whether
the system runs long enough to splatter something
across the screen.

* * *Paul

Just a quick follow-up to some of the questions and comments.

--it is a linux system

--when locked the machine as a whole is locked, not just the window
system. For example, the nmachine does not respond to a ping.

--no log records the error, AFAICT

--the UCBD test programs also run from a cd that boots a custom linux
kernel. Not sure whether the dynamic frequency scaling module
(cpufreq_ondemand) is enabled or not, whether all cores get exercised
simlutaneously, etc. This needs to be investigated.

--To be certain I'll disable CoolNQuiet next time I boot.

--Overclocking versus voltate margining: Doing both (two dimensions)
is generating what is called "Smhmoo plot" in chip parlance. I was
hoping to do voltage margin, but it looks like instead I may have to
do frequency margining. I also have only +0.1V steps available, which
is rather coarse.

--As a rule, if a chip is spec'ed at X volts there is generally a +/-Y
% margin also specified, because no system can guarantee an exact
voltage. Chips quite often are spec'ed at +/-5% or +/-10%, although
processors may have tighter specs, I do not know.

I will follow some of the suggestions and see what I can find. My main
problem is that it can take days to provoke the failure, hence my
desire for additional and fine-grained stress.

**[email protected]** · #5 February 4th 09, 06:17 AM posted to alt.comp.hardware

On Feb 3, 9:49*am, wrote:
On Feb 3, 2:01*am, Paul wrote:

wrote:
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several * programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?

Creating the voltage versus frequency curve, is what
overclockers (or underclockers) do. For example, on
my latest purchase, I know that an extra 0.1V on Vcore,
allows a 33% overclock. By proceeding in small steps of
frequency, and adjusting Vcore for the "same level of
stability" for each test point, you can produce your
own voltage versus frequency curve. On an older
processor (Northwood), I got to see the "brick wall
pattern", where at a certain point, all the extra
(safe) voltage that could be applied, didn't allow
any higher overclock.

In terms of features, AMD and Intel have Cool N' Quiet (CNQ)
and Enhanced SpeedStep (EIST). Depending on OS loading,
if these features are enabled, the voltage and frequency are
changed dynamically, at up to 30 times per second. The
multiplier might vary between 6X and 9X say, with some
small difference in Vcore applied to those two conditions,
according to the manufacturer's declaration of what is
enough to make it work.

So if you are having stability issues, your first step is to
disable CNQ or EIST. The purpose of doing that, is not to
blame those features for the stability issue (as they're not
likely to be the problem), but to make the test conditions
a stable, known quantity. You want just one frequency involved,
when doing a test case, as you're attempting to do
characterization.

On my processor, I believe the Vcore setting is policed by the
processor. My Core2 has VID bits, to drive the Vcore regulator.
And by using tools that can control the multiplier setting, and
drive out new Vcore values while the system is running, the
processor seems to have an upper limit set, as to what bit
pattern it will allow to be passed on the VID bits. That
prevents any useful level of overvolting on my newest system.
Previous generations of systems, used things like overclock
controller chips, to allow "in-band" VID changes.

On some motherboards, you may notice the nomenclature "+0.1V"
for a Vcore setting. Rather than a more direct "1.300V" setting
in the BIOS. I interpret this to mean, the motherboard design has
a feature to bump Vcore, independent of the VID bits. So the
"+0.1V" thing is meant to imply an offset applied in the
Vcore regulator. I had to do something similar to my motherboard
with a soldering iron. I now have a socket, where I can fit
a 1/4W resistor, and by varying the value, I get a voltage boost.
My motherboard is unlike some other brands, in not offering
any out-of-band voltage boost feature. So I had to implement
my own, using instructions from other users who did the
analysis before me. You likely won't have to go through this.
I'm explaining this, in case you cannot reconcile what is
happening while you're testing (setting says one thing,
measured value is some other value). If the set value and
the measured value don't match, part of that difference is
due to "droop", and part can be because of a boost which is
applied independent of the VID bits.

As Kony says, a driver could be responsible for the problem.
The Mersenne Prime95 test is pretty good at finding bad
RAM, and since you've run that for a few hours, that
helps to eliminate bad memory. Prime95 can only test the
memory which is separate from the portion used by the OS,
so it is possible there are still some areas of the RAM
that have not been tested as thoroughly.

Other things that might freeze, might be a misadjusted
bus multiplier, like what is used for Hypertransport between
processor and Northbridge. Or a SATA or IDE clock which
is too far from nominal. So clock signals to other
hardware parts in your system, could give a freezing
symptom. Data or some transaction to the processor
could be frozen, and the processor might still be
running.

Another comment - I've noticed on my older overclocking
test projects, that the processor would crash on an
error. My current Core2 system tends to freeze, rather than
giving an old-fashioned blue screen. So there can be
some differences from one generation to another, as
to what part of the processor is failing, and whether
the system runs long enough to splatter something
across the screen.

* * *Paul

Just a quick follow-up to some of the questions and comments.

--it is a linux system

--when locked the machine as a whole is locked, not just the window
system. For example, the nmachine does not respond to a ping.

--no log records the error, AFAICT

--the UCBD test programs also run from a cd that boots a custom linux
kernel. Not sure whether the dynamic frequency scaling module
(cpufreq_ondemand) is enabled or not, whether all cores get exercised
simlutaneously, etc. This needs to be investigated.

--To be certain I'll disable CoolNQuiet next time I boot.

--Overclocking versus voltate margining: Doing both (two dimensions)
is *generating what is called "Smhmoo plot" in chip parlance. I was
hoping *to do voltage margin, but it looks like instead I may have to
do *frequency margining. I also have only +0.1V steps available, which
is rather coarse.

--As a rule, if a chip is spec'ed at X volts there is generally a +/-Y
% * margin also specified, because no system can guarantee an exact
voltage. Chips quite often are spec'ed at +/-5% or +/-10%, although
processors may have tighter specs, I do not know.

I will follow some of the suggestions and see what I can find. My main
problem is that it can take days to provoke the failure, hence my
desire for additional and fine-grained stress.

Okay, I did some frequency margin tests by stepping up the FSB
frequency from 200 MHz in 5MHz increments. The cpu frequency was 11.5x
the FSB and the memory frequency was 3+1/3 x the FSB. All voltages at
nominal levels.

POST passed up to 24 5MHz
BOOT (linux) passed up to 240 MHz
Memtest is currently running at 245 MHz with no errors in 27 minutes

CNQ is *off*.

It may be the graphics driver, then, as Kony was saying. At least I
feel more comfortable about the hardware at this point, having seen a
20% frequency margin before any large-scale failures (at nominal
voltage settings).

**larry moe 'n curly** · #6 February 5th 09, 01:12 AM posted to alt.comp.hardware

wrote:

Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

I've seen many memory modules pass MemTest86+ but fail MemTest86.
Similarly, many modules passed GoldMemory ver. 6.92 but failed ver.
5.07. OTOH every module I've tested that failed GM ver. 5.07
eventually failed MT86 ver. 3.xx, and vice-versa.

**[email protected]** · #7 February 5th 09, 06:46 PM posted to alt.comp.hardware

On Feb 4, 5:12*pm, "larry moe 'n curly"
wrote:
wrote:
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several * programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

I've seen many memory modules pass MemTest86+ but fail MemTest86.
Similarly, many modules passed GoldMemory ver. 6.92 but failed ver.
5.07. *OTOH every module I've tested that failed GM ver. 5.07
eventually failed MT86 ver. 3.xx, and vice-versa.

That is interesting information. I have a somewhat superficial
knowledge of memory testing, pattern sensitivities and such. I wonder
what is the difference between the programs, especially considering
that the NEWER versions appear to be less stressful than the older
versions in some cases.

On a related note, memory errors are sometimes (perhaps often?)
transient. Do any of the programs keep and save a bad-address-list so
that one can go back and retest the specific address (or regions)
where the failure occurred? At least some of the programs run very
much standalone with little OS support...

**spamme0** · #8 February 6th 09, 02:41 PM posted to alt.comp.hardware

wrote:
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?

A wise mentor once told me the secret to debugging. I'll pass it on
to you if you promise to keep it a secret.

Every time you type a question mark, stop and ponder the question.
EXACTLY what are you going to do with the answer?
Plot yourself a decision tree, if only in your head.
If the answer is yes, I'm gonna do this.
If it's no, I'll do that.
If it.s 3.4, I'm gonna do the other thing.

After you've done this for a while, it will become obvious
that most questions (tests done) don't need to be asked.
If you're gonna do the same thing no matter what the answer,
skip it and move on.

Another thing that happens is that most of the branches lead
nowhere. If you can't hypothesize a set of results leading
to
something you can actually fix, you need a new plan.
If a set of answers leads nowhere, you don't need any of the
intermediate results.

Pondering the range of possible answers to your question
leads you to much more efficient debugging.
This is a process that will give you better than average
debugging results...but you won't find the cure for cancer with this
strategy.

So, back to your question...
You turn down the volts and it fails.
Now what?
Can you be sure it's the same failure?
How much lower is enough lower?
And what are you going to do to fix it?

Some questions can't be answered with technology you can afford.
Even if it is a voltage problem, you won't be able to
measure it with a voltmeter. You'll need a VERY fast digital
storage scope and a set of probes you'll have to mortgage your
house to buy. It'll be a voltage droop when during a dma
transfer while the disk is seeking and the video memory
crosses a certain memory address and all the address lines
change at once...on tuesday when the moon is full.

You do this kind of debugging on prototypes and subsystems. For failed
customer units, you throw them away.

**kony** · #9 February 6th 09, 07:09 PM posted to alt.comp.hardware

On Fri, 06 Feb 2009 06:41:59 -0800, spamme0
wrote:

Some questions can't be answered with technology you can afford.
Even if it is a voltage problem, you won't be able to
measure it with a voltmeter. You'll need a VERY fast digital
storage scope and a set of probes you'll have to mortgage your
house to buy. It'll be a voltage droop when during a dma
transfer while the disk is seeking and the video memory
crosses a certain memory address and all the address lines
change at once...on tuesday when the moon is full.

You do this kind of debugging on prototypes and subsystems. For failed
customer units, you throw them away.

Considering the impedance and capacitance in the circuits
involved, a relatively slow cheap scope would find such a
problem, even a good multimeter with high/low hold value
feature probably would.

**[email protected]** · #10 February 6th 09, 08:14 PM posted to alt.comp.hardware

On Feb 6, 6:41*am, spamme0 wrote:
wrote:
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650:

The system has locked up twice in the last four days, and I am
contemplating trying some accelerated testing to find out what is
going on.

So far I have tried several * programs that are on Ultimate Boot CD
(UBCD), such as memtest86+, cpuburn and mersenneprime, and all have
passed several hours of testing (7+ hours in the memtest86+ case).

Nevertheless, the system has locked up again, so I am wondering what I
might do to provoke the failure again for debug purposes.

In chip testing, it is common to "margin" the chip by essentially
turning down the supply voltage until the chip starts failing in an
obvious and frequent manner.

My question is: What is the collective experience with such tests at
the system and motherboard levels? One of the problems I run into is
that the BIOS only seems to permit turning the voltages (cpu,
memory, ...) UP rather than down.

I suppose I could also try and stress the system by overclocking it,
but somehow I'd feel more convinced if I could do some voltage margin
testing.

Any ideas or experiences that pertain to this matter?

A wise mentor once told me the secret to debugging. *I'll pass it on
to you if you promise to keep it a secret.

Every time you type a question mark, stop and ponder the question.
EXACTLY what are you going to do with the answer?
Plot yourself a decision tree, if only in your head.
If the answer is yes, I'm gonna do this.
If it's no, I'll do that.
If it.s 3.4, I'm gonna do the other thing.

After you've done this for a while, it will become obvious
that most questions (tests done) don't need to be asked.
If you're gonna do the same thing no matter what the answer,
skip it and move on.

Another thing that happens is that most of the branches lead
nowhere. *If you can't hypothesize a set of results leading
to
something you can actually fix, you need a new plan.
If a set of answers leads nowhere, you don't need any of the
intermediate results.

Pondering the range of possible answers to your question
leads you to much more efficient debugging.
This is a process that will give you better than average
debugging results...but you won't find the cure for cancer with this
strategy.

So, back to your question...
You turn down the volts and it fails.
Now what?
Can you be sure it's the same failure?
How much lower is enough lower?
And what are you going to do to fix it?

Some questions can't be answered with technology you can afford.
Even if it is a voltage problem, you won't be able to
measure it with a voltmeter. *You'll need a VERY fast digital
storage scope and a set of probes you'll have to mortgage your
house to buy. *It'll be a voltage droop when during a dma
transfer while the disk is seeking and the video memory
crosses a certain memory address and all the address lines
change at once...on tuesday when the moon is full.

You do this kind of debugging on prototypes and subsystems. *For failed
customer units, you throw them away.

Point taken, in my case I was trying to determine whether it really
was bad hardware or (as some suggested) a software bug. Right now I am
leaning toward the latter, as the system has not frozen since I
upgraded the video driver (knock on wood).

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
sys_basher version 1.1.1, system stress test for *nix systems	General Schvantzkopf	Intel	0	October 22nd 07 03:30 AM
sys_basher version 1.1.1, system stress test for *nix systems	General Schvantzkopf	AMD x86-64 Processors	0	October 22nd 07 03:30 AM
sys_basher version 1.1.1, system stress test for *nix systems	General Schvantzkopf	Overclocking	0	October 22nd 07 03:30 AM
Bootable system stability test program?	OhioGuy	Homebuilt PC's	0	March 14th 07 02:44 PM
What software sill test system stability?	Stan Shankman	Overclocking AMD Processors	10	February 5th 06 12:59 PM