If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
scsi parity errors in a SAN ?
I have several Seagate disks here which complain about scsi parity errors :
0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb info 0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8 I was wondering how, when using FCP (scsi over FC), these errors can occur ? Fibre Channel itself has a parity check AND has tons of illegal characters (ie illegal 10-bit combinations), so it's unlikely for any combination of bit-flips to go undetected at the FC level. My question basically is : how can an FC frame's payload have a parity error, but the frame itself have a valid one ? I assumed any FC device, including the disks, would throw frames with CRC errors out in class 3 (which FCP uses), causing a timeout and retry at the FC-4 level. Then how can the disk ever report scsi parity errors ? Or am I interpreting the error message incorrectly, and does the disk's FC adapter report CRC problems noticed at the FC level as scsi parity problems ? Thanks a heap in advance. Arne |
#2
|
|||
|
|||
Good question.
Im not the expert, but... It sounds like you have a good knowledge about FCP and SCSI-3, so I will recommend you go here http://www.t10.org/ftp/t10/drafts/fcp3/fcp3r01.pdf to retrieve the answer. This doc should have everything to answer your questions. However, you can always look at http://www.t10.org/scsi-3.htm to view all the docs they have on the subject. Hope it helped Robert Arne Joris wrote in message ... I have several Seagate disks here which complain about scsi parity errors : 0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb info 0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8 I was wondering how, when using FCP (scsi over FC), these errors can occur ? Fibre Channel itself has a parity check AND has tons of illegal characters (ie illegal 10-bit combinations), so it's unlikely for any combination of bit-flips to go undetected at the FC level. My question basically is : how can an FC frame's payload have a parity error, but the frame itself have a valid one ? I assumed any FC device, including the disks, would throw frames with CRC errors out in class 3 (which FCP uses), causing a timeout and retry at the FC-4 level. Then how can the disk ever report scsi parity errors ? Or am I interpreting the error message incorrectly, and does the disk's FC adapter report CRC problems noticed at the FC level as scsi parity problems ? Thanks a heap in advance. Arne |
#3
|
|||
|
|||
0x20000004cf246cf8 lun 0x0 cmd 0x2a status 0x2 err 0x70 seg 0x0 byte2 0xb
info 0xc37000 alen 10 csi 0x0 asc 0x47 ascq 0x0 fru 0x3 sks 0x8 I was wondering how, when using FCP (scsi over FC), these errors can occur ? Yes of course, SCSI Parity errors can occur. Remember, SCSI is still the embedded protocoll. Many disk systems have internal native scsi disks. Often, these disk are connected to a control processor board which do the SCSI to FC convertion as well. FC is used for host connection. My assumption is that you might have internal scsi problems (physical problems) which are reported like the usual way, check condition, request sense, asc 0x47, ascq 0x00. Well, on FC-protocol we do have other ways of error detection - you mentioned CRC, thats right. These errors have nothing to do with any error reporting of the embedded protocol. René |
#4
|
|||
|
|||
Rene Köhnen-Wiesemes wrote:
I was wondering how, when using FCP (scsi over FC), these errors can occur ? Yes of course, SCSI Parity errors can occur. Remember, SCSI is still the embedded protocoll. Many disk systems have internal native scsi disks. Often, these disk are connected to a control processor board which do the SCSI to FC convertion as well. FC is used for host connection. My assumption is that you might have internal scsi problems (physical problems) which are reported like the usual way, check condition, request sense, asc 0x47, ascq 0x00. Well, on FC-protocol we do have other ways of error detection - you mentioned CRC, thats right. These errors have nothing to do with any error reporting of the embedded protocol. Good point, internal disk problems could cause scsi parity errors. But here's what brought me to my question : several disks in several enclosures reported scsi parity errors, but when we changed the *initiator's* switch port, the scsi parity errors disappeared. So that made me rule out problems with the disks themselves. We're trying to decide if there's something wrong with that switch port or not. The setup is as follows : initiator --- edge switch --- enclosure My reasoning was : 1. If the initiator, it's SFP, cable, the switch's SFP or the switch port introduced CRC errors, I would think the switch would throw away the frame. 2. If the switch internally corrupts the frame, I guess it would send it out to the targets anyway and it would get there corrupted. 3. Since we changed nothing else but the initiator's switch port to make the problem go away, we can rule out the outgoing switch port, SFP, cable and enclosure's SFP, as well as anything in the enclosure or disks. Now if the FC disk reports CRC errors in the FC frame as scsi parity errors, we have some proof that the frame got corrupted, and from the reasoning above, it must mean the switch somehow did it internally. If the FC disk does not do this, the scsi parity errors are unexplained, since we didn't do anything to the disks to make the error disappear. Today I remembered hearing that modern FC switches (this is a SanBox2) get the frame's header out the door by the time the frame's tail is still coming in, in order to get to the desired switching speed and I guess in order not to need tons of memory to buffer frames. If that is true, I can only assume they calculate the CRC on the fly, and start reading the FC frame's header for destination address before they had a chance to verify the CRC checksum of the entire frame. So what if they at the tail of the frame realize the CRC is incorrect ? So I'm starting to doubt my theory that the CRC can not be introduced somewhere between the initiator and the switch because the switch would have thrown the frame out... Ah, the woes of trouble-shooting Fibre Channel ) Arne |
#5
|
|||
|
|||
My reasoning was :
1. If the initiator, it's SFP, cable, the switch's SFP or the switch port introduced CRC errors, I would think the switch would throw away the frame. No, the switch should not throw away any frame, it only calculates errors like CRC, Frame to long, Frame to short ... 2. If the switch internally corrupts the frame, I guess it would send it out to the targets anyway and it would get there corrupted. Yes, if the switch corrupts the frame it will send it to the target, thats for sure. Switches are kind of passive components to the frame. Well, they converts signals (optical/electrical/optical), buffered signals, check frames and so on, but they create no data (like CRC information). They look into the frame header to decide where it had to go. 3. Since we changed nothing else but the initiator's switch port to make the problem go away, we can rule out the outgoing switch port, SFP, cable and enclosure's SFP, as well as anything in the enclosure or disks. Now if the FC disk reports CRC errors in the FC frame as scsi parity errors, we have some proof that the frame got corrupted, and from the reasoning above, it must mean the switch somehow did it internally. Wait, do you ever checked if there are "unusual" high CRC-error counters inside the switch. Unusual means that the amount of CRC errors are above approx. 1% of Tx/Rx rate. By the way, IMHO there is only one way to make sure that the error happens inside the switch: Compare of in- and outgoing data (frames) with a fibre channel protokoll analyzer. Today I remembered hearing that modern FC switches (this is a SanBox2) get the frame's header out the door by the time the frame's tail is still coming in, in order to get to the desired switching speed and I guess in order not to need tons of memory to buffer frames. If that is true, I can only assume they calculate the CRC on the fly, and start reading the FC frame's header for destination address before they had a chance to verify the CRC checksum of the entire frame. So what if they at the tail of the frame realize the CRC is incorrect ? As I said above, the frames data is generated only inside the initiator, there is no CRC calculation on the fly. If the switch detects a corrupted frame, it count it and send it anyway. Ah, the woes of trouble-shooting Fibre Channel ) Yes, something is still a miracle ;-) René |
#6
|
|||
|
|||
Rene Köhnen-Wiesemes wrote:
2. If the switch internally corrupts the frame, I guess it would send it out to the targets anyway and it would get there corrupted. Yes, if the switch corrupts the frame it will send it to the target, thats for sure. Switches are kind of passive components to the frame. Well, they converts signals (optical/electrical/optical), buffered signals, check frames and so on, but they create no data (like CRC information). They look into the frame header to decide where it had to go. Wait, do you ever checked if there are "unusual" high CRC-error counters inside the switch. Unusual means that the amount of CRC errors are above approx. 1% of Tx/Rx rate. Yes we did, and the CRC counters stayed zero while the disks were reporting parity problems. With that in mind, I guess we have more proof that the switch itself corrupted the frame, because when it received the frames, it saw no problems, yet the disks see them. By the way, IMHO there is only one way to make sure that the error happens inside the switch: Compare of in- and outgoing data (frames) with a fibre channel protokoll analyzer. Indeed, that will be our next step. I've already hooked a Bit Error Rate Tester up to the suspicious switch port, but that didn't give an unsually high BER number : 1e-14 which is well within FC specs. With the analyser I'll also be able to see if the entire frame's CRC matches the contents, and if only the scsi data's parity gets screwed up. I guess there still is a chance that bit flips in the frame's payload are not detected by the Fibre Channel CRC checksum, although it seems weird that we consistently can reproduce this... Today I remembered hearing that modern FC switches (this is a SanBox2) get the frame's header out the door by the time the frame's tail is still coming in, in order to get to the desired switching speed and I guess in order not to need tons of memory to buffer frames. If that is true, I can only assume they calculate the CRC on the fly, and start reading the FC frame's header for destination address before they had a chance to verify the CRC checksum of the entire frame. So what if they at the tail of the frame realize the CRC is incorrect ? As I said above, the frames data is generated only inside the initiator, there is no CRC calculation on the fly. If the switch detects a corrupted frame, it count it and send it anyway. You're right, I was thinking of the switch being able (or allowed to) throw frames out in class 3, but they only do that if they run out of resources (no more buffer credits on the outgoing port for example). Thanks a lot for your help Rene ! Arne Joris |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Looking for SCSI Recommendations | John-Paul Stewart | Asus Motherboards | 0 | February 22nd 05 04:22 PM |
asus p2b-ds and scsi (from a scsi newbie) | [email protected] | Asus Motherboards | 8 | May 30th 04 09:43 AM |
120 gb is the Largest hard drive I can put in my 4550? | David H. Lipman | Dell Computers | 65 | December 11th 03 01:51 PM |
Can't install any OS (hardware fault?) | Paul Richard | Homebuilt PC's | 4 | September 27th 03 12:55 PM |
SCSI trouble | Alien Zord | General | 1 | June 25th 03 03:08 AM |