If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
voltage stress an margin test of system stability
Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo
and amd phenom x3 8650: The system has locked up twice in the last four days, and I am contemplating trying some accelerated testing to find out what is going on. So far I have tried several programs that are on Ultimate Boot CD (UBCD), such as memtest86+, cpuburn and mersenneprime, and all have passed several hours of testing (7+ hours in the memtest86+ case). Nevertheless, the system has locked up again, so I am wondering what I might do to provoke the failure again for debug purposes. In chip testing, it is common to "margin" the chip by essentially turning down the supply voltage until the chip starts failing in an obvious and frequent manner. My question is: What is the collective experience with such tests at the system and motherboard levels? One of the problems I run into is that the BIOS only seems to permit turning the voltages (cpu, memory, ...) UP rather than down. I suppose I could also try and stress the system by overclocking it, but somehow I'd feel more convinced if I could do some voltage margin testing. Any ideas or experiences that pertain to this matter? |
#2
|
|||
|
|||
voltage stress an margin test of system stability
|
#3
|
|||
|
|||
voltage stress an margin test of system stability
|
#4
|
|||
|
|||
voltage stress an margin test of system stability
On Feb 3, 2:01*am, Paul wrote:
wrote: Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo and amd phenom x3 8650: The system has locked up twice in the last four days, and I am contemplating trying some accelerated testing to find out what is going on. So far I have tried several * programs that are on Ultimate Boot CD (UBCD), such as memtest86+, cpuburn and mersenneprime, and all have passed several hours of testing (7+ hours in the memtest86+ case). Nevertheless, the system has locked up again, so I am wondering what I might do to provoke the failure again for debug purposes. In chip testing, it is common to "margin" the chip by essentially turning down the supply voltage until the chip starts failing in an obvious and frequent manner. My question is: What is the collective experience with such tests at the system and motherboard levels? One of the problems I run into is that the BIOS only seems to permit turning the voltages (cpu, memory, ...) UP rather than down. I suppose I could also try and stress the system by overclocking it, but somehow I'd feel more convinced if I could do some voltage margin testing. Any ideas or experiences that pertain to this matter? Creating the voltage versus frequency curve, is what overclockers (or underclockers) do. For example, on my latest purchase, I know that an extra 0.1V on Vcore, allows a 33% overclock. By proceeding in small steps of frequency, and adjusting Vcore for the "same level of stability" for each test point, you can produce your own voltage versus frequency curve. On an older processor (Northwood), I got to see the "brick wall pattern", where at a certain point, all the extra (safe) voltage that could be applied, didn't allow any higher overclock. In terms of features, AMD and Intel have Cool N' Quiet (CNQ) and Enhanced SpeedStep (EIST). Depending on OS loading, if these features are enabled, the voltage and frequency are changed dynamically, at up to 30 times per second. The multiplier might vary between 6X and 9X say, with some small difference in Vcore applied to those two conditions, according to the manufacturer's declaration of what is enough to make it work. So if you are having stability issues, your first step is to disable CNQ or EIST. The purpose of doing that, is not to blame those features for the stability issue (as they're not likely to be the problem), but to make the test conditions a stable, known quantity. You want just one frequency involved, when doing a test case, as you're attempting to do characterization. On my processor, I believe the Vcore setting is policed by the processor. My Core2 has VID bits, to drive the Vcore regulator. And by using tools that can control the multiplier setting, and drive out new Vcore values while the system is running, the processor seems to have an upper limit set, as to what bit pattern it will allow to be passed on the VID bits. That prevents any useful level of overvolting on my newest system. Previous generations of systems, used things like overclock controller chips, to allow "in-band" VID changes. On some motherboards, you may notice the nomenclature "+0.1V" for a Vcore setting. Rather than a more direct "1.300V" setting in the BIOS. I interpret this to mean, the motherboard design has a feature to bump Vcore, independent of the VID bits. So the "+0.1V" thing is meant to imply an offset applied in the Vcore regulator. I had to do something similar to my motherboard with a soldering iron. I now have a socket, where I can fit a 1/4W resistor, and by varying the value, I get a voltage boost. My motherboard is unlike some other brands, in not offering any out-of-band voltage boost feature. So I had to implement my own, using instructions from other users who did the analysis before me. You likely won't have to go through this. I'm explaining this, in case you cannot reconcile what is happening while you're testing (setting says one thing, measured value is some other value). If the set value and the measured value don't match, part of that difference is due to "droop", and part can be because of a boost which is applied independent of the VID bits. As Kony says, a driver could be responsible for the problem. The Mersenne Prime95 test is pretty good at finding bad RAM, and since you've run that for a few hours, that helps to eliminate bad memory. Prime95 can only test the memory which is separate from the portion used by the OS, so it is possible there are still some areas of the RAM that have not been tested as thoroughly. Other things that might freeze, might be a misadjusted bus multiplier, like what is used for Hypertransport between processor and Northbridge. Or a SATA or IDE clock which is too far from nominal. So clock signals to other hardware parts in your system, could give a freezing symptom. Data or some transaction to the processor could be frozen, and the processor might still be running. Another comment - I've noticed on my older overclocking test projects, that the processor would crash on an error. My current Core2 system tends to freeze, rather than giving an old-fashioned blue screen. So there can be some differences from one generation to another, as to what part of the processor is failing, and whether the system runs long enough to splatter something across the screen. * * *Paul Just a quick follow-up to some of the questions and comments. --it is a linux system --when locked the machine as a whole is locked, not just the window system. For example, the nmachine does not respond to a ping. --no log records the error, AFAICT --the UCBD test programs also run from a cd that boots a custom linux kernel. Not sure whether the dynamic frequency scaling module (cpufreq_ondemand) is enabled or not, whether all cores get exercised simlutaneously, etc. This needs to be investigated. --To be certain I'll disable CoolNQuiet next time I boot. --Overclocking versus voltate margining: Doing both (two dimensions) is generating what is called "Smhmoo plot" in chip parlance. I was hoping to do voltage margin, but it looks like instead I may have to do frequency margining. I also have only +0.1V steps available, which is rather coarse. --As a rule, if a chip is spec'ed at X volts there is generally a +/-Y % margin also specified, because no system can guarantee an exact voltage. Chips quite often are spec'ed at +/-5% or +/-10%, although processors may have tighter specs, I do not know. I will follow some of the suggestions and see what I can find. My main problem is that it can take days to provoke the failure, hence my desire for additional and fine-grained stress. |
#5
|
|||
|
|||
voltage stress an margin test of system stability
On Feb 3, 9:49*am, wrote:
On Feb 3, 2:01*am, Paul wrote: wrote: Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo and amd phenom x3 8650: The system has locked up twice in the last four days, and I am contemplating trying some accelerated testing to find out what is going on. So far I have tried several * programs that are on Ultimate Boot CD (UBCD), such as memtest86+, cpuburn and mersenneprime, and all have passed several hours of testing (7+ hours in the memtest86+ case). Nevertheless, the system has locked up again, so I am wondering what I might do to provoke the failure again for debug purposes. In chip testing, it is common to "margin" the chip by essentially turning down the supply voltage until the chip starts failing in an obvious and frequent manner. My question is: What is the collective experience with such tests at the system and motherboard levels? One of the problems I run into is that the BIOS only seems to permit turning the voltages (cpu, memory, ...) UP rather than down. I suppose I could also try and stress the system by overclocking it, but somehow I'd feel more convinced if I could do some voltage margin testing. Any ideas or experiences that pertain to this matter? Creating the voltage versus frequency curve, is what overclockers (or underclockers) do. For example, on my latest purchase, I know that an extra 0.1V on Vcore, allows a 33% overclock. By proceeding in small steps of frequency, and adjusting Vcore for the "same level of stability" for each test point, you can produce your own voltage versus frequency curve. On an older processor (Northwood), I got to see the "brick wall pattern", where at a certain point, all the extra (safe) voltage that could be applied, didn't allow any higher overclock. In terms of features, AMD and Intel have Cool N' Quiet (CNQ) and Enhanced SpeedStep (EIST). Depending on OS loading, if these features are enabled, the voltage and frequency are changed dynamically, at up to 30 times per second. The multiplier might vary between 6X and 9X say, with some small difference in Vcore applied to those two conditions, according to the manufacturer's declaration of what is enough to make it work. So if you are having stability issues, your first step is to disable CNQ or EIST. The purpose of doing that, is not to blame those features for the stability issue (as they're not likely to be the problem), but to make the test conditions a stable, known quantity. You want just one frequency involved, when doing a test case, as you're attempting to do characterization. On my processor, I believe the Vcore setting is policed by the processor. My Core2 has VID bits, to drive the Vcore regulator. And by using tools that can control the multiplier setting, and drive out new Vcore values while the system is running, the processor seems to have an upper limit set, as to what bit pattern it will allow to be passed on the VID bits. That prevents any useful level of overvolting on my newest system. Previous generations of systems, used things like overclock controller chips, to allow "in-band" VID changes. On some motherboards, you may notice the nomenclature "+0.1V" for a Vcore setting. Rather than a more direct "1.300V" setting in the BIOS. I interpret this to mean, the motherboard design has a feature to bump Vcore, independent of the VID bits. So the "+0.1V" thing is meant to imply an offset applied in the Vcore regulator. I had to do something similar to my motherboard with a soldering iron. I now have a socket, where I can fit a 1/4W resistor, and by varying the value, I get a voltage boost. My motherboard is unlike some other brands, in not offering any out-of-band voltage boost feature. So I had to implement my own, using instructions from other users who did the analysis before me. You likely won't have to go through this. I'm explaining this, in case you cannot reconcile what is happening while you're testing (setting says one thing, measured value is some other value). If the set value and the measured value don't match, part of that difference is due to "droop", and part can be because of a boost which is applied independent of the VID bits. As Kony says, a driver could be responsible for the problem. The Mersenne Prime95 test is pretty good at finding bad RAM, and since you've run that for a few hours, that helps to eliminate bad memory. Prime95 can only test the memory which is separate from the portion used by the OS, so it is possible there are still some areas of the RAM that have not been tested as thoroughly. Other things that might freeze, might be a misadjusted bus multiplier, like what is used for Hypertransport between processor and Northbridge. Or a SATA or IDE clock which is too far from nominal. So clock signals to other hardware parts in your system, could give a freezing symptom. Data or some transaction to the processor could be frozen, and the processor might still be running. Another comment - I've noticed on my older overclocking test projects, that the processor would crash on an error. My current Core2 system tends to freeze, rather than giving an old-fashioned blue screen. So there can be some differences from one generation to another, as to what part of the processor is failing, and whether the system runs long enough to splatter something across the screen. * * *Paul Just a quick follow-up to some of the questions and comments. --it is a linux system --when locked the machine as a whole is locked, not just the window system. For example, the nmachine does not respond to a ping. --no log records the error, AFAICT --the UCBD test programs also run from a cd that boots a custom linux kernel. Not sure whether the dynamic frequency scaling module (cpufreq_ondemand) is enabled or not, whether all cores get exercised simlutaneously, etc. This needs to be investigated. --To be certain I'll disable CoolNQuiet next time I boot. --Overclocking versus voltate margining: Doing both (two dimensions) is *generating what is called "Smhmoo plot" in chip parlance. I was hoping *to do voltage margin, but it looks like instead I may have to do *frequency margining. I also have only +0.1V steps available, which is rather coarse. --As a rule, if a chip is spec'ed at X volts there is generally a +/-Y % * margin also specified, because no system can guarantee an exact voltage. Chips quite often are spec'ed at +/-5% or +/-10%, although processors may have tighter specs, I do not know. I will follow some of the suggestions and see what I can find. My main problem is that it can take days to provoke the failure, hence my desire for additional and fine-grained stress. Okay, I did some frequency margin tests by stepping up the FSB frequency from 200 MHz in 5MHz increments. The cpu frequency was 11.5x the FSB and the memory frequency was 3+1/3 x the FSB. All voltages at nominal levels. POST passed up to 24 5MHz BOOT (linux) passed up to 240 MHz Memtest is currently running at 245 MHz with no errors in 27 minutes CNQ is *off*. It may be the graphics driver, then, as Kony was saying. At least I feel more comfortable about the hardware at this point, having seen a 20% frequency margin before any large-scale failures (at nominal voltage settings). |
#6
|
|||
|
|||
voltage stress an margin test of system stability
wrote: Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo and amd phenom x3 8650: The system has locked up twice in the last four days, and I am contemplating trying some accelerated testing to find out what is going on. So far I have tried several programs that are on Ultimate Boot CD (UBCD), such as memtest86+, cpuburn and mersenneprime, and all have passed several hours of testing (7+ hours in the memtest86+ case). I've seen many memory modules pass MemTest86+ but fail MemTest86. Similarly, many modules passed GoldMemory ver. 6.92 but failed ver. 5.07. OTOH every module I've tested that failed GM ver. 5.07 eventually failed MT86 ver. 3.xx, and vice-versa. |
#7
|
|||
|
|||
voltage stress an margin test of system stability
On Feb 4, 5:12*pm, "larry moe 'n curly"
wrote: wrote: Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo and amd phenom x3 8650: The system has locked up twice in the last four days, and I am contemplating trying some accelerated testing to find out what is going on. So far I have tried several * programs that are on Ultimate Boot CD (UBCD), such as memtest86+, cpuburn and mersenneprime, and all have passed several hours of testing (7+ hours in the memtest86+ case). I've seen many memory modules pass MemTest86+ but fail MemTest86. Similarly, many modules passed GoldMemory ver. 6.92 but failed ver. 5.07. *OTOH every module I've tested that failed GM ver. 5.07 eventually failed MT86 ver. 3.xx, and vice-versa. That is interesting information. I have a somewhat superficial knowledge of memory testing, pattern sensitivities and such. I wonder what is the difference between the programs, especially considering that the NEWER versions appear to be less stressful than the older versions in some cases. On a related note, memory errors are sometimes (perhaps often?) transient. Do any of the programs keep and save a bad-address-list so that one can go back and retest the specific address (or regions) where the failure occurred? At least some of the programs run very much standalone with little OS support... |
#8
|
|||
|
|||
voltage stress an margin test of system stability
|
#9
|
|||
|
|||
voltage stress an margin test of system stability
On Fri, 06 Feb 2009 06:41:59 -0800, spamme0
wrote: Some questions can't be answered with technology you can afford. Even if it is a voltage problem, you won't be able to measure it with a voltmeter. You'll need a VERY fast digital storage scope and a set of probes you'll have to mortgage your house to buy. It'll be a voltage droop when during a dma transfer while the disk is seeking and the video memory crosses a certain memory address and all the address lines change at once...on tuesday when the moon is full. You do this kind of debugging on prototypes and subsystems. For failed customer units, you throw them away. Considering the impedance and capacitance in the circuits involved, a relatively slow cheap scope would find such a problem, even a good multimeter with high/low hold value feature probably would. |
#10
|
|||
|
|||
voltage stress an margin test of system stability
On Feb 6, 6:41*am, spamme0 wrote:
wrote: Another interesting experience with my gigabyte-ga-ma78gm-s2hp mobo and amd phenom x3 8650: The system has locked up twice in the last four days, and I am contemplating trying some accelerated testing to find out what is going on. So far I have tried several * programs that are on Ultimate Boot CD (UBCD), such as memtest86+, cpuburn and mersenneprime, and all have passed several hours of testing (7+ hours in the memtest86+ case). Nevertheless, the system has locked up again, so I am wondering what I might do to provoke the failure again for debug purposes. In chip testing, it is common to "margin" the chip by essentially turning down the supply voltage until the chip starts failing in an obvious and frequent manner. My question is: What is the collective experience with such tests at the system and motherboard levels? One of the problems I run into is that the BIOS only seems to permit turning the voltages (cpu, memory, ...) UP rather than down. I suppose I could also try and stress the system by overclocking it, but somehow I'd feel more convinced if I could do some voltage margin testing. Any ideas or experiences that pertain to this matter? A wise mentor once told me the secret to debugging. *I'll pass it on to you if you promise to keep it a secret. Every time you type a question mark, stop and ponder the question. EXACTLY what are you going to do with the answer? Plot yourself a decision tree, if only in your head. If the answer is yes, I'm gonna do this. If it's no, I'll do that. If it.s 3.4, I'm gonna do the other thing. After you've done this for a while, it will become obvious that most questions (tests done) don't need to be asked. If you're gonna do the same thing no matter what the answer, skip it and move on. Another thing that happens is that most of the branches lead nowhere. *If you can't hypothesize a set of results leading to something you can actually fix, you need a new plan. If a set of answers leads nowhere, you don't need any of the intermediate results. Pondering the range of possible answers to your question leads you to much more efficient debugging. This is a process that will give you better than average debugging results...but you won't find the cure for cancer with this strategy. So, back to your question... You turn down the volts and it fails. Now what? Can you be sure it's the same failure? How much lower is enough lower? And what are you going to do to fix it? Some questions can't be answered with technology you can afford. Even if it is a voltage problem, you won't be able to measure it with a voltmeter. *You'll need a VERY fast digital storage scope and a set of probes you'll have to mortgage your house to buy. *It'll be a voltage droop when during a dma transfer while the disk is seeking and the video memory crosses a certain memory address and all the address lines change at once...on tuesday when the moon is full. You do this kind of debugging on prototypes and subsystems. *For failed customer units, you throw them away. Point taken, in my case I was trying to determine whether it really was bad hardware or (as some suggested) a software bug. Right now I am leaning toward the latter, as the system has not frozen since I upgraded the video driver (knock on wood). |
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
sys_basher version 1.1.1, system stress test for *nix systems | General Schvantzkopf | Intel | 0 | October 22nd 07 03:30 AM |
sys_basher version 1.1.1, system stress test for *nix systems | General Schvantzkopf | AMD x86-64 Processors | 0 | October 22nd 07 03:30 AM |
sys_basher version 1.1.1, system stress test for *nix systems | General Schvantzkopf | Overclocking | 0 | October 22nd 07 03:30 AM |
Bootable system stability test program? | OhioGuy | Homebuilt PC's | 0 | March 14th 07 02:44 PM |
What software sill test system stability? | Stan Shankman | Overclocking AMD Processors | 10 | February 5th 06 12:59 PM |