View Single Post
  #4  
Old December 28th 08, 01:04 PM posted to comp.os.linux.hardware,alt.comp.hardware.overclocking
Paul
external usenet poster
 
Posts: 13,364
Default New release of sys_basher

Aragorn wrote:
On Sunday 28 December 2008 05:24, someone identifying as *General
Schvantzkoph* wrote in /comp.os.linux.hardwa/

I've put a new release of sys_basher on the web,

http://www.polybus.com/sys_basher_web/

sys_basher is a multi-threaded hardware exerciser, memory tester and
benchmarking tool. It will run on any Linux or Unix.


Does it also do diagnostics of other possibly failing hardware components
than memory? I'm just asking because I've got a machine sitting here idly
for quite some time now due to strange lockups and BIOS ECC error log
entries.

I had the machine checked by a tech but he couldn't find anything, and he
had run some benchmarking thing on it using an SQL database and had it
running like that for several days without - again, according to him - any
errors.

It was a rather expensive machine at the time, and I've only recently put in
a brandnew Adaptec 2130 SLP PCI-X U320 SCSI RAID controller and two even so
brandnew Hitachi 73 GB U320 SCSI disks. (The errors and crashes already
predate that "transplant", though.)

The motherboard is an Intel server board - I think a 7500CW - with two Intel
Xeon 2.2 GHz (400 Hz FSB) 32-bit processors with hyperthreading. The
memory is 4 GB (4x 1 GB) Transcend ECC registered DDR-266, running at 200
MHz. /memtest86/ shows no errors whatsoever, and the power supply is 350
Watts but doesn't pull more than 220 Watts during boot-up and appears to
check out fine. The BIOS is a Phoenix, but don't ask me what release. ;-)

One of the strange things is that often during the Linux kernel boot
process, only three of the four hyperthreaded virtual CPUs are found -
occasionally only two, even. This "failure" is noticeable in advance
before the kernel actually starts displaying its boot messages by the delay
in switching from standard VGA resolution to the higher resolution
framebuffer, and there is a higher chance of this oddness occurring when
you press the /Enter/ key in the GRUB or LILO boot menu before the timeout
has expired. It then also shows strange messages like "booting processor
3/7", suggesting that the kernel sees eight processors, while it's a
two-socket motherboard with two hyperthreaded Xeons.

The machine has had this flaw from the beginning, but at the time it was
still rather exceptional, while by now it's rather exceptional to still
have it recognize all four virtual CPUs. It also used to fully lock up
without anything serious running, but the rate at which this would happen
was unpredictable. One time it would run for a whole week, the other time
it would lock up after only a few hours, or even earlier.

The machine has had Mandrake 10.0 PowerPack on it with a custom-built
vanilla 2.6.x kernel - various releases, starting with 2.6.5 and ending
with 2.6.17 or something - for many years but since it has gotten the newer
SCSI disks I've installed it with CentOS 5.1, as it was intended to be used
for our still very preliminary webhosting, and our hosting software
(DirectAdmin) only works with CentOS (or is only supported by the
developers to work with CentOS, anyway).

I'm mentioning all this because quite obviously everyone appears to be
stumped with regard to what could be wrong with this machine, and you come
across like somewhat of an Intel /connoisseur./ So maybe you've got any
clues? ;-)


There is a manual for the SE7500CW2 here.

http://download.intel.com/support/mo...500cw2/tps.pdf

http://support.intel.com/support/mot...ver/se7500cw2/

(Picture)
http://www.xeonchassis.com/images/IntelCW-OH.jpg

The design uses a shared VRM for both processors. The Intel document says
the processors should be matched. And that is because they're both going
to be getting their Vcore from the same source. The VRM supports a total
load of 130W or the usage of two 65W processors. The motherboard checks
the VID from both processors, to make sure they're matched, so that
should provide some protection against a completely mismatched
set of processors from a voltage perspective.

You could concentrate on running a memory test. Or use multiple copies
of the Linux version of Prime95, as a means of doing an integrity test
on memory and processor. That should run the CPU at 100%, especially
if running four or more copies. (Prime95 is good, because it won't
be held back by disks or storage subsystems. A single math error and
it catches it.)

http://www.mersenne.org/freesoft/#newusers

In terms of strip-down procedures, you could try running with one
processor at a time, and see whether the symptoms are the same in
each case. (No terminator is needed in the empty second socket.)

With four sticks of RAM, you can also run just two sticks at a time,
and test them that way. (The board is dual channel, and the manual
claims a two stick minimum. If may run with just one stick,
but I don't immediately see that suggested in the manual.)

A 350W power supply, could be a dual redundant 350W with load
sharing, or it could be a $20 single supply from Quickie-Mart.
You'd need to have more of a look at it, to judge whether it
is adequate (the label has a bunch of limits printed on it).
In a quick web search, I see SE7500CW2 based machines shipping
with 450W supplies, to give you an idea what others use. But if
you have 20 disk drives in the box, obviously that requires
more beef.

If you want help with hardware, it helps to start your own thread
about your hardware problems, and not hijack the General's thread :-)

Paul