A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » General Hardware & Peripherals » Storage & Hardrives
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

scsi parity errors in a SAN ?



 
 
Thread Tools Display Modes
  #1  
Old October 16th 03, 07:14 PM
Arne Joris
external usenet poster
 
Posts: n/a
Default scsi parity errors in a SAN ?

I have several Seagate disks here which complain about scsi parity errors :

0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb info
0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8

I was wondering how, when using FCP (scsi over FC), these errors can occur ?

Fibre Channel itself has a parity check AND has tons of illegal characters (ie
illegal 10-bit combinations), so it's unlikely for any combination of bit-flips
to go undetected at the FC level. My question basically is : how can an FC
frame's payload have a parity error, but the frame itself have a valid one ?

I assumed any FC device, including the disks, would throw frames with CRC errors
out in class 3 (which FCP uses), causing a timeout and retry at the FC-4 level.
Then how can the disk ever report scsi parity errors ? Or am I interpreting the
error message incorrectly, and does the disk's FC adapter report CRC problems
noticed at the FC level as scsi parity problems ?

Thanks a heap in advance.

Arne

  #2  
Old October 17th 03, 12:51 AM
Robert
external usenet poster
 
Posts: n/a
Default

Good question.
Im not the expert, but...
It sounds like you have a good knowledge about FCP and SCSI-3, so I
will recommend you go here
http://www.t10.org/ftp/t10/drafts/fcp3/fcp3r01.pdf
to retrieve the answer. This doc should have everything to answer your
questions. However, you can always look at
http://www.t10.org/scsi-3.htm to view all the docs they have on the
subject.

Hope it helped

Robert


Arne Joris wrote in message ...
I have several Seagate disks here which complain about scsi parity errors :

0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb info
0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8

I was wondering how, when using FCP (scsi over FC), these errors can occur ?

Fibre Channel itself has a parity check AND has tons of illegal characters (ie
illegal 10-bit combinations), so it's unlikely for any combination of bit-flips
to go undetected at the FC level. My question basically is : how can an FC
frame's payload have a parity error, but the frame itself have a valid one ?

I assumed any FC device, including the disks, would throw frames with CRC errors
out in class 3 (which FCP uses), causing a timeout and retry at the FC-4 level.
Then how can the disk ever report scsi parity errors ? Or am I interpreting the
error message incorrectly, and does the disk's FC adapter report CRC problems
noticed at the FC level as scsi parity problems ?

Thanks a heap in advance.

Arne

  #3  
Old October 17th 03, 08:43 AM
Rene Köhnen-Wiesemes
external usenet poster
 
Posts: n/a
Default

0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb
info
0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8

I was wondering how, when using FCP (scsi over FC), these errors can occur

?

Yes of course, SCSI Parity errors can occur. Remember, SCSI is still the
embedded protocoll. Many disk systems have internal native scsi disks.
Often, these disk are connected to a control processor board which do the
SCSI to FC convertion as well. FC is used for host connection. My assumption
is that you might have internal scsi problems (physical problems) which are
reported like the usual way, check condition, request sense, asc 0x47, ascq
0x00.
Well, on FC-protocol we do have other ways of error detection - you
mentioned CRC, thats
right. These errors have nothing to do with any error reporting of the
embedded protocol.

René


  #4  
Old October 17th 03, 03:03 PM
Arne Joris
external usenet poster
 
Posts: n/a
Default

Rene Köhnen-Wiesemes wrote:
I was wondering how, when using FCP (scsi over FC), these errors can occur


?

Yes of course, SCSI Parity errors can occur. Remember, SCSI is still the
embedded protocoll. Many disk systems have internal native scsi disks.
Often, these disk are connected to a control processor board which do the
SCSI to FC convertion as well. FC is used for host connection. My assumption
is that you might have internal scsi problems (physical problems) which are
reported like the usual way, check condition, request sense, asc 0x47, ascq
0x00.
Well, on FC-protocol we do have other ways of error detection - you
mentioned CRC, thats
right. These errors have nothing to do with any error reporting of the
embedded protocol.


Good point, internal disk problems could cause scsi parity errors.
But here's what brought me to my question : several disks in several enclosures
reported scsi parity errors, but when we changed the *initiator's* switch port,
the scsi parity errors disappeared. So that made me rule out problems with the
disks themselves. We're trying to decide if there's something wrong with that
switch port or not.
The setup is as follows :

initiator --- edge switch --- enclosure

My reasoning was :

1. If the initiator, it's SFP, cable, the switch's SFP or the switch port
introduced CRC errors, I would think the switch would throw away the frame.
2. If the switch internally corrupts the frame, I guess it would send it out to
the targets anyway and it would get there corrupted.
3. Since we changed nothing else but the initiator's switch port to make the
problem go away, we can rule out the outgoing switch port, SFP, cable and
enclosure's SFP, as well as anything in the enclosure or disks.

Now if the FC disk reports CRC errors in the FC frame as scsi parity errors, we
have some proof that the frame got corrupted, and from the reasoning above, it
must mean the switch somehow did it internally. If the FC disk does not do this,
the scsi parity errors are unexplained, since we didn't do anything to the disks
to make the error disappear.

Today I remembered hearing that modern FC switches (this is a SanBox2) get the
frame's header out the door by the time the frame's tail is still coming in, in
order to get to the desired switching speed and I guess in order not to need
tons of memory to buffer frames. If that is true, I can only assume they
calculate the CRC on the fly, and start reading the FC frame's header for
destination address before they had a chance to verify the CRC checksum of the
entire frame. So what if they at the tail of the frame realize the CRC is
incorrect ? So I'm starting to doubt my theory that the CRC can not be
introduced somewhere between the initiator and the switch because the switch
would have thrown the frame out...

Ah, the woes of trouble-shooting Fibre Channel )


Arne

  #5  
Old October 17th 03, 04:00 PM
Rene Köhnen-Wiesemes
external usenet poster
 
Posts: n/a
Default

My reasoning was :



1. If the initiator, it's SFP, cable, the switch's SFP or the


switch port


introduced CRC errors, I would think the switch would throw


away the frame.


No, the switch should not throw away any frame, it only calculates

errors like CRC, Frame to long, Frame to short ...

2. If the switch internally corrupts the frame, I guess it


would send it out to


the targets anyway and it would get there corrupted.


Yes, if the switch corrupts the frame it will send it to the

target, thats for sure. Switches are kind of passive components

to the frame. Well, they converts signals (optical/electrical/optical),

buffered signals, check frames and so on, but they create no data (like

CRC information). They look into the frame header to decide where

it had to go.

3. Since we changed nothing else but the initiator's switch


port to make the


problem go away, we can rule out the outgoing switch port,


SFP, cable and


enclosure's SFP, as well as anything in the enclosure or disks.




Now if the FC disk reports CRC errors in the FC frame as scsi


parity errors, we


have some proof that the frame got corrupted, and from the


reasoning above, it


must mean the switch somehow did it internally.


Wait, do you ever checked if there are "unusual" high CRC-error

counters inside the switch. Unusual means that the amount of

CRC errors are above approx. 1% of Tx/Rx rate.

By the way, IMHO there is only one way to make sure that the

error happens inside the switch: Compare of in- and outgoing

data (frames) with a fibre channel protokoll analyzer.



Today I remembered hearing that modern FC switches (this is a


SanBox2) get the


frame's header out the door by the time the frame's tail is


still coming in, in


order to get to the desired switching speed and I guess in


order not to need


tons of memory to buffer frames. If that is true, I can only


assume they


calculate the CRC on the fly, and start reading the FC


frame's header for


destination address before they had a chance to verify the


CRC checksum of the


entire frame. So what if they at the tail of the frame


realize the CRC is


incorrect ?


As I said above, the frames data is generated only inside

the initiator, there is no CRC calculation on the fly.

If the switch detects a corrupted frame, it count it and

send it anyway.

Ah, the woes of trouble-shooting Fibre Channel )


Yes, something is still a miracle ;-)



René


  #6  
Old October 18th 03, 07:56 PM
Arne Joris
external usenet poster
 
Posts: n/a
Default

Rene Köhnen-Wiesemes wrote:
2. If the switch internally corrupts the frame, I guess it
would send it out to the targets anyway and it would get there corrupted.


Yes, if the switch corrupts the frame it will send it to the
target, thats for sure. Switches are kind of passive components
to the frame. Well, they converts signals (optical/electrical/optical),
buffered signals, check frames and so on, but they create no data (like
CRC information). They look into the frame header to decide where
it had to go.


Wait, do you ever checked if there are "unusual" high CRC-error
counters inside the switch. Unusual means that the amount of
CRC errors are above approx. 1% of Tx/Rx rate.


Yes we did, and the CRC counters stayed zero while the disks were
reporting parity problems. With that in mind, I guess we have more proof
that the switch itself corrupted the frame, because when it received the
frames, it saw no problems, yet the disks see them.


By the way, IMHO there is only one way to make sure that the
error happens inside the switch: Compare of in- and outgoing
data (frames) with a fibre channel protokoll analyzer.


Indeed, that will be our next step. I've already hooked a Bit Error Rate
Tester up to the suspicious switch port, but that didn't give an
unsually high BER number : 1e-14 which is well within FC specs.

With the analyser I'll also be able to see if the entire frame's CRC
matches the contents, and if only the scsi data's parity gets screwed up.
I guess there still is a chance that bit flips in the frame's payload
are not detected by the Fibre Channel CRC checksum, although it seems
weird that we consistently can reproduce this...

Today I remembered hearing that modern FC switches (this is a
SanBox2) get the frame's header out the door by the time the frame's tail is
still coming in, in order to get to the desired switching speed and I guess in
order not to need tons of memory to buffer frames. If that is true, I can only
assume they calculate the CRC on the fly, and start reading the FC
frame's header for destination address before they had a chance to verify the
CRC checksum of the entire frame. So what if they at the tail of the frame
realize the CRC is incorrect ?


As I said above, the frames data is generated only inside
the initiator, there is no CRC calculation on the fly.
If the switch detects a corrupted frame, it count it and
send it anyway.


You're right, I was thinking of the switch being able (or allowed to)
throw frames out in class 3, but they only do that if they run out of
resources (no more buffer credits on the outgoing port for example).

Thanks a lot for your help Rene !

Arne Joris

 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Looking for SCSI Recommendations John-Paul Stewart Asus Motherboards 0 February 22nd 05 04:22 PM
asus p2b-ds and scsi (from a scsi newbie) [email protected] Asus Motherboards 8 May 30th 04 09:43 AM
120 gb is the Largest hard drive I can put in my 4550? David H. Lipman Dell Computers 65 December 11th 03 01:51 PM
Can't install any OS (hardware fault?) Paul Richard Homebuilt PC's 4 September 27th 03 12:55 PM
SCSI trouble Alien Zord General 1 June 25th 03 03:08 AM


All times are GMT +1. The time now is 05:09 AM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.