scsi parity errors in a SAN ?

#1 October 16th 03, 07:14 PM

I have several Seagate disks here which complain about scsi parity errors :

0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb info
0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8

I was wondering how, when using FCP (scsi over FC), these errors can occur ?

Fibre Channel itself has a parity check AND has tons of illegal characters (ie
illegal 10-bit combinations), so it's unlikely for any combination of bit-flips
to go undetected at the FC level. My question basically is : how can an FC
frame's payload have a parity error, but the frame itself have a valid one ?

I assumed any FC device, including the disks, would throw frames with CRC errors
out in class 3 (which FCP uses), causing a timeout and retry at the FC-4 level.
Then how can the disk ever report scsi parity errors ? Or am I interpreting the
error message incorrectly, and does the disk's FC adapter report CRC problems
noticed at the FC level as scsi parity problems ?

Thanks a heap in advance.

Arne

#2 October 17th 03, 12:51 AM

Good question.
Im not the expert, but...
It sounds like you have a good knowledge about FCP and SCSI-3, so I
will recommend you go here
http://www.t10.org/ftp/t10/drafts/fcp3/fcp3r01.pdf
to retrieve the answer. This doc should have everything to answer your
questions. However, you can always look at
http://www.t10.org/scsi-3.htm to view all the docs they have on the
subject.

Hope it helped

Robert

Arne Joris wrote in message ...
I have several Seagate disks here which complain about scsi parity errors :

0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb info
0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8

I was wondering how, when using FCP (scsi over FC), these errors can occur ?

Fibre Channel itself has a parity check AND has tons of illegal characters (ie
illegal 10-bit combinations), so it's unlikely for any combination of bit-flips
to go undetected at the FC level. My question basically is : how can an FC
frame's payload have a parity error, but the frame itself have a valid one ?

I assumed any FC device, including the disks, would throw frames with CRC errors
out in class 3 (which FCP uses), causing a timeout and retry at the FC-4 level.
Then how can the disk ever report scsi parity errors ? Or am I interpreting the
error message incorrectly, and does the disk's FC adapter report CRC problems
noticed at the FC level as scsi parity problems ?

Thanks a heap in advance.

Arne

#3 October 17th 03, 08:43 AM

0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb
info
0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8

I was wondering how, when using FCP (scsi over FC), these errors can occur
?

Yes of course, SCSI Parity errors can occur. Remember, SCSI is still the
embedded protocoll. Many disk systems have internal native scsi disks.
Often, these disk are connected to a control processor board which do the
SCSI to FC convertion as well. FC is used for host connection. My assumption
is that you might have internal scsi problems (physical problems) which are
reported like the usual way, check condition, request sense, asc 0x47, ascq
0x00.
Well, on FC-protocol we do have other ways of error detection - you
mentioned CRC, thats
right. These errors have nothing to do with any error reporting of the
embedded protocol.

René

#4 October 17th 03, 03:03 PM

Rene Köhnen-Wiesemes wrote:
I was wondering how, when using FCP (scsi over FC), these errors can occur

?

Yes of course, SCSI Parity errors can occur. Remember, SCSI is still the
embedded protocoll. Many disk systems have internal native scsi disks.
Often, these disk are connected to a control processor board which do the
SCSI to FC convertion as well. FC is used for host connection. My assumption
is that you might have internal scsi problems (physical problems) which are
reported like the usual way, check condition, request sense, asc 0x47, ascq
0x00.
Well, on FC-protocol we do have other ways of error detection - you
mentioned CRC, thats
right. These errors have nothing to do with any error reporting of the
embedded protocol.

Good point, internal disk problems could cause scsi parity errors.
But here's what brought me to my question : several disks in several enclosures
reported scsi parity errors, but when we changed the *initiator's* switch port,
the scsi parity errors disappeared. So that made me rule out problems with the
disks themselves. We're trying to decide if there's something wrong with that
switch port or not.
The setup is as follows :

initiator --- edge switch --- enclosure

My reasoning was :

1. If the initiator, it's SFP, cable, the switch's SFP or the switch port
introduced CRC errors, I would think the switch would throw away the frame.
2. If the switch internally corrupts the frame, I guess it would send it out to
the targets anyway and it would get there corrupted.
3. Since we changed nothing else but the initiator's switch port to make the
problem go away, we can rule out the outgoing switch port, SFP, cable and
enclosure's SFP, as well as anything in the enclosure or disks.

Now if the FC disk reports CRC errors in the FC frame as scsi parity errors, we
have some proof that the frame got corrupted, and from the reasoning above, it
must mean the switch somehow did it internally. If the FC disk does not do this,
the scsi parity errors are unexplained, since we didn't do anything to the disks
to make the error disappear.

Today I remembered hearing that modern FC switches (this is a SanBox2) get the
frame's header out the door by the time the frame's tail is still coming in, in
order to get to the desired switching speed and I guess in order not to need
tons of memory to buffer frames. If that is true, I can only assume they
calculate the CRC on the fly, and start reading the FC frame's header for
destination address before they had a chance to verify the CRC checksum of the
entire frame. So what if they at the tail of the frame realize the CRC is
incorrect ? So I'm starting to doubt my theory that the CRC can not be
introduced somewhere between the initiator and the switch because the switch
would have thrown the frame out...

Ah, the woes of trouble-shooting Fibre Channel

)

Arne

#5 October 17th 03, 04:00 PM

My reasoning was :

1. If the initiator, it's SFP, cable, the switch's SFP or the

switch port

introduced CRC errors, I would think the switch would throw

away the frame.

No, the switch should not throw away any frame, it only calculates

errors like CRC, Frame to long, Frame to short ...

2. If the switch internally corrupts the frame, I guess it

would send it out to

the targets anyway and it would get there corrupted.

Yes, if the switch corrupts the frame it will send it to the

target, thats for sure. Switches are kind of passive components

to the frame. Well, they converts signals (optical/electrical/optical),

buffered signals, check frames and so on, but they create no data (like

CRC information). They look into the frame header to decide where

it had to go.

3. Since we changed nothing else but the initiator's switch

port to make the

problem go away, we can rule out the outgoing switch port,

SFP, cable and

enclosure's SFP, as well as anything in the enclosure or disks.

Now if the FC disk reports CRC errors in the FC frame as scsi

parity errors, we

have some proof that the frame got corrupted, and from the

reasoning above, it

must mean the switch somehow did it internally.

Wait, do you ever checked if there are "unusual" high CRC-error

counters inside the switch. Unusual means that the amount of

CRC errors are above approx. 1% of Tx/Rx rate.

By the way, IMHO there is only one way to make sure that the

error happens inside the switch: Compare of in- and outgoing

data (frames) with a fibre channel protokoll analyzer.

Today I remembered hearing that modern FC switches (this is a

SanBox2) get the

frame's header out the door by the time the frame's tail is

still coming in, in

order to get to the desired switching speed and I guess in

order not to need

tons of memory to buffer frames. If that is true, I can only

assume they

calculate the CRC on the fly, and start reading the FC

frame's header for

destination address before they had a chance to verify the

CRC checksum of the

entire frame. So what if they at the tail of the frame

realize the CRC is

incorrect ?

As I said above, the frames data is generated only inside

the initiator, there is no CRC calculation on the fly.

If the switch detects a corrupted frame, it count it and

send it anyway.

Ah, the woes of trouble-shooting Fibre Channel )

Yes, something is still a miracle ;-)

René

#6 October 18th 03, 07:56 PM

Rene Köhnen-Wiesemes wrote:
2. If the switch internally corrupts the frame, I guess it
would send it out to the targets anyway and it would get there corrupted.

Yes, if the switch corrupts the frame it will send it to the
target, thats for sure. Switches are kind of passive components
to the frame. Well, they converts signals (optical/electrical/optical),
buffered signals, check frames and so on, but they create no data (like
CRC information). They look into the frame header to decide where
it had to go.

Wait, do you ever checked if there are "unusual" high CRC-error
counters inside the switch. Unusual means that the amount of
CRC errors are above approx. 1% of Tx/Rx rate.

Yes we did, and the CRC counters stayed zero while the disks were
reporting parity problems. With that in mind, I guess we have more proof
that the switch itself corrupted the frame, because when it received the
frames, it saw no problems, yet the disks see them.

By the way, IMHO there is only one way to make sure that the
error happens inside the switch: Compare of in- and outgoing
data (frames) with a fibre channel protokoll analyzer.

Indeed, that will be our next step. I've already hooked a Bit Error Rate
Tester up to the suspicious switch port, but that didn't give an
unsually high BER number : 1e-14 which is well within FC specs.

With the analyser I'll also be able to see if the entire frame's CRC
matches the contents, and if only the scsi data's parity gets screwed up.
I guess there still is a chance that bit flips in the frame's payload
are not detected by the Fibre Channel CRC checksum, although it seems
weird that we consistently can reproduce this...

Today I remembered hearing that modern FC switches (this is a
SanBox2) get the frame's header out the door by the time the frame's tail is
still coming in, in order to get to the desired switching speed and I guess in
order not to need tons of memory to buffer frames. If that is true, I can only
assume they calculate the CRC on the fly, and start reading the FC
frame's header for destination address before they had a chance to verify the
CRC checksum of the entire frame. So what if they at the tail of the frame
realize the CRC is incorrect ?

As I said above, the frames data is generated only inside
the initiator, there is no CRC calculation on the fly.
If the switch detects a corrupted frame, it count it and
send it anyway.

You're right, I was thinking of the switch being able (or allowed to)
throw frames out in class 3, but they only do that if they run out of
resources (no more buffer credits on the outgoing port for example).

Thanks a lot for your help Rene !

Arne Joris

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Looking for SCSI Recommendations	John-Paul Stewart	Asus Motherboards	0	February 22nd 05 04:22 PM
asus p2b-ds and scsi (from a scsi newbie)	[email protected]	Asus Motherboards	8	May 30th 04 09:43 AM
120 gb is the Largest hard drive I can put in my 4550?	David H. Lipman	Dell Computers	65	December 11th 03 01:51 PM
Can't install any OS (hardware fault?)	Paul Richard	Homebuilt PC's	4	September 27th 03 12:55 PM
SCSI trouble	Alien Zord	General	1	June 25th 03 03:08 AM