PDA

View Full Version : SCSI RAID problems


Andrew Wasielewski
September 21st 04, 12:23 AM
Hello SCSI friends,

I am having some strange/worrying problems with my SCSI RAID setup. Perhaps someone can help?

I have got a Gigabyte GA-KNXP Ultra m/b with embedded Adaptec AIC-7902 U320 SCSI controller. I have got a RAID 10 array made up of 4 x Maxtor Atlas IV 10k. 36GB disks (SCSI IDs 3, 4, 5 & 6) on Channel A. The 72GB useable space is partitioned into a 4GB user partition + 4 others for paging file, apps & data (+ some free space left over). I run WinXP Pro SP1.

A little while ago I started to get signs of hardware problems:-
a.. Errors in the Windows Event Log (ID 9) saying "The device, \Device\Scsi\a320raid1, did not respond within the timeout period." A few times the system crashed about the same time;
b.. Nasty "scrunching" sounds coming from the disks;
c.. In the SCSI BIOS the array is tagged "Degraded". On drilling down, the disk on ID 4 is tagged "Degraded", all the others are "Optimal".
After a short while the errors & noise go away, so I assume the disk has finally died & prepare to RMA it. In the interim I get a different error in the event log (ID 7) saying "The device, \Device\Harddisk0\D, has a bad block." whenever (and only when) I try to backup drive C: to the Tandberg SLR7 tape drive using the WinXP Backup utility. (I get error ID 14 saying "The shadow copy of volume C: was aborted because of an IO failure." at the same time. I can't find out much about these errors, other than that he first one usually means disk failure is imminent; however as the O/S can only see the logical disk, not the individual physical devices, I put this down to a glitch from the disk failure. Apart from this, all apps appear to work normally as far as I can tell.

When I get the replacement disk I replace the "Degraded" one using the same SCSI ID & attempt a rebuild. However this fails "Dur ro read error on device ID 3". When I run Verify Media on disk ID 3 it finds 3 bad blocks; it says it has remapped these, but on re-running both the array rebuild & the verify media come up with the same errors. I put the "old" disk on Channel B, expecting it to be dead; however it appears in the BIOS, and when I run verify media it comes up with 1 bad block. Again, remapping fails to fix it, but after a low-level format verify media finds no errors.

So the disk I originally thought was dead now appears to be as good as new (so too presumably is the replacement), however the *real* problem seems to lie with disk ID 3, even though the BIOS says it is optimal - and I can't rebuild the array. As the array is now running on only 3 disks out of 4 I am reluctant to do anything with ID 3 that might corrupt the array. Do I have any alternative other than to rebuild the array & reinstall everything? Fortunately I can still make backups of all partitions *except* C:, and I guess I can save anything I need from there on CR-R, but it is still a real pain...

Anyone have any ideas what is really going on, & how I can recover it? And how many, if any, dud disks do I have? The noises when the problem originally manifested definitely sounded very physical!

Thanks in advance for all help & advice.

Andrew

Anthony Preston
September 22nd 04, 10:47 AM
Are the disks hot?

I was getting scsi disk errors due to the 4 disks that I had configured where getting to hot and started to play up.

I have since added additional fans and now the disks are easy to touch and handle but before I added the fans I could hardly touch them.


"Andrew Wasielewski" > wrote in message ...
Hello SCSI friends,

I am having some strange/worrying problems with my SCSI RAID setup. Perhaps someone can help?

I have got a Gigabyte GA-KNXP Ultra m/b with embedded Adaptec AIC-7902 U320 SCSI controller. I have got a RAID 10 array made up of 4 x Maxtor Atlas IV 10k. 36GB disks (SCSI IDs 3, 4, 5 & 6) on Channel A. The 72GB useable space is partitioned into a 4GB user partition + 4 others for paging file, apps & data (+ some free space left over). I run WinXP Pro SP1.

A little while ago I started to get signs of hardware problems:-
a.. Errors in the Windows Event Log (ID 9) saying "The device, \Device\Scsi\a320raid1, did not respond within the timeout period." A few times the system crashed about the same time;
b.. Nasty "scrunching" sounds coming from the disks;
c.. In the SCSI BIOS the array is tagged "Degraded". On drilling down, the disk on ID 4 is tagged "Degraded", all the others are "Optimal".
After a short while the errors & noise go away, so I assume the disk has finally died & prepare to RMA it. In the interim I get a different error in the event log (ID 7) saying "The device, \Device\Harddisk0\D, has a bad block." whenever (and only when) I try to backup drive C: to the Tandberg SLR7 tape drive using the WinXP Backup utility. (I get error ID 14 saying "The shadow copy of volume C: was aborted because of an IO failure." at the same time. I can't find out much about these errors, other than that he first one usually means disk failure is imminent; however as the O/S can only see the logical disk, not the individual physical devices, I put this down to a glitch from the disk failure. Apart from this, all apps appear to work normally as far as I can tell.

When I get the replacement disk I replace the "Degraded" one using the same SCSI ID & attempt a rebuild. However this fails "Dur ro read error on device ID 3". When I run Verify Media on disk ID 3 it finds 3 bad blocks; it says it has remapped these, but on re-running both the array rebuild & the verify media come up with the same errors. I put the "old" disk on Channel B, expecting it to be dead; however it appears in the BIOS, and when I run verify media it comes up with 1 bad block. Again, remapping fails to fix it, but after a low-level format verify media finds no errors.

So the disk I originally thought was dead now appears to be as good as new (so too presumably is the replacement), however the *real* problem seems to lie with disk ID 3, even though the BIOS says it is optimal - and I can't rebuild the array. As the array is now running on only 3 disks out of 4 I am reluctant to do anything with ID 3 that might corrupt the array. Do I have any alternative other than to rebuild the array & reinstall everything? Fortunately I can still make backups of all partitions *except* C:, and I guess I can save anything I need from there on CR-R, but it is still a real pain...

Anyone have any ideas what is really going on, & how I can recover it? And how many, if any, dud disks do I have? The noises when the problem originally manifested definitely sounded very physical!

Thanks in advance for all help & advice.

Andrew

Mal
September 22nd 04, 02:10 PM
when you say you have a Raid 10 array setup ... do you mean a raid 1/0? ...
the only raid setups I've come across are either Raid0, Raid1, Raid1/0 or
Raid5.

It sounds like you have a Raid 0/1 array from the drives you have and the
space available (Raid5 would give you 108Gb -- 3x36Gb + 4th drive online
spare) ... how is the mirroring setup? ... if the drives on scsi id's 5&6
are a mirror of drives 3&4 then you should be able to restore from them. If
that's the case you could try rebuilding the array using your original drive
4 and the replacement while keeping drive 3 intact

let us know how you get on

Mal

Andrew Wasielewski
September 23rd 04, 12:09 AM
Disks don't feel noticably warm, let alone hot. Case is a Lian-Li PC-71 with plenty of fans, so don't think that can be it...
"Anthony Preston" > wrote in message ...
Are the disks hot?

I was getting scsi disk errors due to the 4 disks that I had configured where getting to hot and started to play up.

I have since added additional fans and now the disks are easy to touch and handle but before I added the fans I could hardly touch them.


"Andrew Wasielewski" > wrote in message ...
Hello SCSI friends,

I am having some strange/worrying problems with my SCSI RAID setup. Perhaps someone can help?

I have got a Gigabyte GA-KNXP Ultra m/b with embedded Adaptec AIC-7902 U320 SCSI controller. I have got a RAID 10 array made up of 4 x Maxtor Atlas IV 10k. 36GB disks (SCSI IDs 3, 4, 5 & 6) on Channel A. The 72GB useable space is partitioned into a 4GB user partition + 4 others for paging file, apps & data (+ some free space left over). I run WinXP Pro SP1.

A little while ago I started to get signs of hardware problems:-
a.. Errors in the Windows Event Log (ID 9) saying "The device, \Device\Scsi\a320raid1, did not respond within the timeout period." A few times the system crashed about the same time;
b.. Nasty "scrunching" sounds coming from the disks;
c.. In the SCSI BIOS the array is tagged "Degraded". On drilling down, the disk on ID 4 is tagged "Degraded", all the others are "Optimal".
After a short while the errors & noise go away, so I assume the disk has finally died & prepare to RMA it. In the interim I get a different error in the event log (ID 7) saying "The device, \Device\Harddisk0\D, has a bad block." whenever (and only when) I try to backup drive C: to the Tandberg SLR7 tape drive using the WinXP Backup utility. (I get error ID 14 saying "The shadow copy of volume C: was aborted because of an IO failure." at the same time. I can't find out much about these errors, other than that he first one usually means disk failure is imminent; however as the O/S can only see the logical disk, not the individual physical devices, I put this down to a glitch from the disk failure. Apart from this, all apps appear to work normally as far as I can tell.

When I get the replacement disk I replace the "Degraded" one using the same SCSI ID & attempt a rebuild. However this fails "Dur ro read error on device ID 3". When I run Verify Media on disk ID 3 it finds 3 bad blocks; it says it has remapped these, but on re-running both the array rebuild & the verify media come up with the same errors. I put the "old" disk on Channel B, expecting it to be dead; however it appears in the BIOS, and when I run verify media it comes up with 1 bad block. Again, remapping fails to fix it, but after a low-level format verify media finds no errors.

So the disk I originally thought was dead now appears to be as good as new (so too presumably is the replacement), however the *real* problem seems to lie with disk ID 3, even though the BIOS says it is optimal - and I can't rebuild the array. As the array is now running on only 3 disks out of 4 I am reluctant to do anything with ID 3 that might corrupt the array. Do I have any alternative other than to rebuild the array & reinstall everything? Fortunately I can still make backups of all partitions *except* C:, and I guess I can save anything I need from there on CR-R, but it is still a real pain...

Anyone have any ideas what is really going on, & how I can recover it? And how many, if any, dud disks do I have? The noises when the problem originally manifested definitely sounded very physical!

Thanks in advance for all help & advice.

Andrew

Andrew Wasielewski
September 23rd 04, 12:21 AM
RAID 0+1 is another name for my setup i.e. striping + mirroring. Since the
4 disks are identical I presume there is a 1-to-1 correspondence between the
the blocks on one side of the mirror and the other, as the mirroring logic
doesn't care whether & how they are striped. However in that case I don't
know how the disks are paired off. There wasn't anywhere to specify it in
the array setup in the SCSI BIOS, & I can't see anything that displays it.

I am wary about finding out the hard way by disconnecting disk ID 3, as if
that turns out to be the currently non-redundent disk I don't want to risk
corrupting the array irretrievably. Or will it simply fail to recognise the
array at all in that case, until I put back the missing disk?

"Mal" > wrote in message
...
> when you say you have a Raid 10 array setup ... do you mean a raid 1/0?
....
> the only raid setups I've come across are either Raid0, Raid1, Raid1/0 or
> Raid5.
>
> It sounds like you have a Raid 0/1 array from the drives you have and the
> space available (Raid5 would give you 108Gb -- 3x36Gb + 4th drive online
> spare) ... how is the mirroring setup? ... if the drives on scsi id's 5&6
> are a mirror of drives 3&4 then you should be able to restore from them.
If
> that's the case you could try rebuilding the array using your original
drive
> 4 and the replacement while keeping drive 3 intact
>
> let us know how you get on
>
> Mal
>
>

Tim
September 23rd 04, 01:31 AM
Hi,

I can't see the original post for this due to ISP clobbering the
newsgroup....

What type of raid controller is it?

- Tim





"Andrew Wasielewski" > wrote in message
...
> RAID 0+1 is another name for my setup i.e. striping + mirroring. Since
> the
> 4 disks are identical I presume there is a 1-to-1 correspondence between
> the
> the blocks on one side of the mirror and the other, as the mirroring logic
> doesn't care whether & how they are striped. However in that case I don't
> know how the disks are paired off. There wasn't anywhere to specify it in
> the array setup in the SCSI BIOS, & I can't see anything that displays it.
>
> I am wary about finding out the hard way by disconnecting disk ID 3, as if
> that turns out to be the currently non-redundent disk I don't want to risk
> corrupting the array irretrievably. Or will it simply fail to recognise
> the
> array at all in that case, until I put back the missing disk?
>
> "Mal" > wrote in message
> ...
>> when you say you have a Raid 10 array setup ... do you mean a raid 1/0?
> ...
>> the only raid setups I've come across are either Raid0, Raid1, Raid1/0 or
>> Raid5.
>>
>> It sounds like you have a Raid 0/1 array from the drives you have and the
>> space available (Raid5 would give you 108Gb -- 3x36Gb + 4th drive online
>> spare) ... how is the mirroring setup? ... if the drives on scsi id's 5&6
>> are a mirror of drives 3&4 then you should be able to restore from them.
> If
>> that's the case you could try rebuilding the array using your original
> drive
>> 4 and the replacement while keeping drive 3 intact
>>
>> let us know how you get on
>>
>> Mal
>>
>>
>
>

Tim Kelley
September 23rd 04, 01:51 AM
In article >, Andrew Wasielewski wrote:
> I am having some strange/worrying problems with my SCSI RAID setup. =
> Perhaps someone can help?
>
> I have got a Gigabyte GA-KNXP Ultra m/b with embedded Adaptec AIC-7902 =
> U320 SCSI controller. I have got a RAID 10 array made up of 4 x Maxtor =
> Atlas IV 10k. 36GB disks (SCSI IDs 3, 4, 5 & 6) on Channel A. The 72GB =
> useable space is partitioned into a 4GB user partition + 4 others for =
> paging file, apps & data (+ some free space left over). I run WinXP Pro =
> SP1.

When you're having mystifyng problems, try looking at heat and power
....

That's a lot of drives and if you have a lot of other stuff, perhaps
your power supply can't handle it? That can cause all manner of
weirdness.

Are they too hot? What's the temp sensor on the drives say?

--
_ _ _ _ _ _ _ _ _ _ _ _ _
/ \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \ / \
( t | i | m | @ | i | t | . | k | p | t | . | c | c )
\_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/ \_/
GPG key fingerprint = 1DEE CD9B 4808 F608 FBBF DC21 2807 D7D3 09CA 85BF

Folkert Rienstra
September 23rd 04, 10:48 PM
"Tim" > wrote in message
> Hi,
>
> I can't see the original post for this due to ISP clobbering the newsgroup....



Some ISPs filter HTML posts.

You can find the message-ID in the header.
Cut and paste it to your address bar and type "news:" in front of it.
Or go to the top most message and click the attribution line.

>
> What type of raid controller is it?
>
> - Tim
>

> "Andrew Wasielewski" > wrote in message
> ...

[snip]