If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Troubleshooting SANs
I work for a consulting firm, and have begun to do troubleshooting on
small SANs, mostly HP MSA1500cs based. Many times the problem the customer is talking about is some vague intermittent slowness issue or something like that. In cases like this, my troubleshooting goes something like this: 1. Check switch logs for marginal ports or other errors (usually brocade 4/24s or similar) 2. Update to latest firmware and driver levels on HBAs, Switch, MSA, etc. If the problem still exists, I'll call HP support, but more often than not they can't really help from here. So the only approach that yields results is to start unplugging stuff until I see the problem disappear. In one recent instance, I had a customer start shutting blades off until he found that one of them had an HBA that was mysteriously causing the intermittent slowness for the whole SAN. The HBA actually seemed to work, and there were no errors in the Windows event logs, or switch logs, sansurfer, or anything. There has got to be a better way to find this kind of thing. On an IP network, I would run Ethereal or some other packet analyzer to try and see what is talking on the network when the problem manifests. But I've never really found anything like that for a fibre channel SAN. As I said, I'm pretty new to SAN, so any direction would be helpful. Thanks, Sean |
#2
|
|||
|
|||
Troubleshooting SANs
Uzytkownik napisal w wiadomosci
ups.com... I work for a consulting firm, and have begun to do troubleshooting on small SANs, mostly HP MSA1500cs based. Many times the problem the customer is talking about is some vague intermittent slowness issue or something like that. In cases like this, my troubleshooting goes something like this: 1. Check switch logs for marginal ports or other errors (usually brocade 4/24s or similar) 2. Update to latest firmware and driver levels on HBAs, Switch, MSA, etc. If the problem still exists, I'll call HP support, but more often than not they can't really help from here. So the only approach that yields results is to start unplugging stuff until I see the problem disappear. In one recent instance, I had a customer start shutting blades off until he found that one of them had an HBA that was mysteriously causing the intermittent slowness for the whole SAN. The HBA actually seemed to work, and there were no errors in the Windows event logs, or switch logs, sansurfer, or anything. There has got to be a better way to find this kind of thing. On an IP network, I would run Ethereal or some other packet analyzer to try and see what is talking on the network when the problem manifests. But I've never really found anything like that for a fibre channel SAN. As I said, I'm pretty new to SAN, so any direction would be helpful. Thanks, Sean Hi Sean, check http://www.finisar.com/index.php?fil...d%2 0Analysis Good luck, Piotr |
#3
|
|||
|
|||
Troubleshooting SANs
Hi Sean,
check http://www.finisar.com/index.php?fil...d%2 0Analysis Good luck, Piotr Yeah I found some of that stuff. The problem with everything I've found is that it requires Taps. I haven't found anything equivalent to a "mirroring port" on a switch. Does such a thing exist? |
#4
|
|||
|
|||
Troubleshooting SANs
Uzytkownik "Sean Howard" napisal w wiadomosci . .. Hi Sean, check http://www.finisar.com/index.php?fil...d%2 0Analysis Good luck, Piotr Yeah I found some of that stuff. The problem with everything I've found is that it requires Taps. I haven't found anything equivalent to a "mirroring port" on a switch. Does such a thing exist? Yes it does, but not on every product. As far as I am aware you can find it on Brocade 48000 directors and Brocade 5000 FC switches. There is a good reason for using Taps in SAN monitoring and troubleshooting (see below as found in a Finsar document covering this problem). 1. Multiple ports mirrored to one port causes buffer overflow and dropped packets. 2. Packets go through a buffer and are retimed, making accurate time sensitive measurements impossible, such as jitter, packet gap analysis, or latency. 3. Most mirror ports filter anomalies, thus making troubleshooting impossible. 4. Turning on port mirroring puts a load on the switch's CPU/transfer logic thus impacting the switch's operational performance. Piotr |
#5
|
|||
|
|||
Troubleshooting SANs
Since when do 48k's (or *any* Brocade switch) support port mirroring?
One would think the Condor's could handle it, but I've never seen it implemented in Brocade's product line. I'm not sure about Cisco. To the OP, the only way I know of is tapping the fabric. There are FC protocol analyzers, but they sit in band. -Mark On Mar 13, 7:50 am, "Piotr" wrote: Uzytkownik "Sean Howard" napisal w wiadomoscinews:vZ6dnQ0LoffOaWjYnZ2dnUVZ_oCmnZ2d@co mcast.com... Hi Sean, check http://www.finisar.com/index.php?fil...ct&div_id=smen... Good luck, Piotr Yeah I found some of that stuff. The problem with everything I've found is that it requires Taps. I haven't found anything equivalent to a "mirroring port" on a switch. Does such a thing exist? Yes it does, but not on every product. As far as I am aware you can find it on Brocade 48000 directors and Brocade 5000 FC switches. There is a good reason for using Taps in SAN monitoring and troubleshooting (see below as found in a Finsar document covering this problem). 1. Multiple ports mirrored to one port causes buffer overflow and dropped packets. 2. Packets go through a buffer and are retimed, making accurate time sensitive measurements impossible, such as jitter, packet gap analysis, or latency. 3. Most mirror ports filter anomalies, thus making troubleshooting impossible. 4. Turning on port mirroring puts a load on the switch's CPU/transfer logic thus impacting the switch's operational performance. Piotr |
#6
|
|||
|
|||
Troubleshooting SANs
wrote in message ups.com... I work for a consulting firm, and have begun to do troubleshooting on small SANs, mostly HP MSA1500cs based. Many times the problem the customer is talking about is some vague intermittent slowness issue or something like that. In cases like this, my troubleshooting goes something like this: 1. Check switch logs for marginal ports or other errors (usually brocade 4/24s or similar) 2. Update to latest firmware and driver levels on HBAs, Switch, MSA, etc. If the problem still exists, I'll call HP support, but more often than not they can't really help from here. So the only approach that yields results is to start unplugging stuff until I see the problem disappear. In one recent instance, I had a customer start shutting blades off until he found that one of them had an HBA that was mysteriously causing the intermittent slowness for the whole SAN. The HBA actually seemed to work, and there were no errors in the Windows event logs, or switch logs, sansurfer, or anything. There has got to be a better way to find this kind of thing. On an IP network, I would run Ethereal or some other packet analyzer to try and see what is talking on the network when the problem manifests. But I've never really found anything like that for a fibre channel SAN. As I said, I'm pretty new to SAN, so any direction would be helpful. Thanks, Sean You're correct. There is no such thing as port mirroring or fibre channel software analyzer such as Ethernet's Ethereal. Your best bet in this scenario without using an inline fibre channel analyzer (Finisar is the defacto standard) is to use an application such as SCSI Utility For Windows to monitor the HBA port statistics to determine what errors man be happening. The Moojit |
#7
|
|||
|
|||
Troubleshooting SANs
Sean,
I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically I would guess that the blade you closed down was doing Target Resets. If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end result is all the initiators slow down but no errors are displayed. Zoning won't help. You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common. The HP box might provide some higher debug level that exposes target resets so you can track them down. From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.) I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers. Follow http://support.microsoft.com/default...b;EN-US;923830 MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for /arbitrat/ as in arbitration or something like that. There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems). wrote: I work for a consulting firm, and have begun to do troubleshooting on small SANs, mostly HP MSA1500cs based. Many times the problem the customer is talking about is some vague intermittent slowness issue or something like that. In cases like this, my troubleshooting goes something like this: 1. Check switch logs for marginal ports or other errors (usually brocade 4/24s or similar) 2. Update to latest firmware and driver levels on HBAs, Switch, MSA, etc. If the problem still exists, I'll call HP support, but more often than not they can't really help from here. So the only approach that yields results is to start unplugging stuff until I see the problem disappear. In one recent instance, I had a customer start shutting blades off until he found that one of them had an HBA that was mysteriously causing the intermittent slowness for the whole SAN. The HBA actually seemed to work, and there were no errors in the Windows event logs, or switch logs, sansurfer, or anything. There has got to be a better way to find this kind of thing. On an IP network, I would run Ethereal or some other packet analyzer to try and see what is talking on the network when the problem manifests. But I've never really found anything like that for a fibre channel SAN. As I said, I'm pretty new to SAN, so any direction would be helpful. Thanks, Sean |
#8
|
|||
|
|||
Troubleshooting SANs
On 25 Mar, 00:50, Bob S wrote:
Sean, I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically I would guess that the blade you closed down was doing Target Resets. If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end result is all the initiators slow down but no errors are displayed. Zoning won't help. You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common. The HP box might provide some higher debug level that exposes target resets so you can track them down. From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.) I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers. Followhttp://support.microsoft.com/default.aspx?scid=kb;EN-US;923830 MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for /arbitrat/ as in arbitration or something like that. There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems). wrote: I work for a consulting firm, and have begun to do troubleshooting on small SANs, mostly HP MSA1500cs based. Many times the problem the customer is talking about is some vague intermittent slowness issue or something like that. In cases like this, my troubleshooting goes something like this: 1. Check switch logs for marginal ports or other errors (usually brocade 4/24s or similar) 2. Update to latest firmware and driver levels on HBAs, Switch, MSA, etc. If the problem still exists, I'll call HP support, but more often than not they can't really help from here. So the only approach that yields results is to start unplugging stuff until I see the problem disappear. In one recent instance, I had a customer start shutting blades off until he found that one of them had an HBA that was mysteriously causing the intermittent slowness for the whole SAN. The HBA actually seemed to work, and there were no errors in the Windows event logs, or switch logs, sansurfer, or anything. There has got to be a better way to find this kind of thing. On an IP network, I would run Ethereal or some other packet analyzer to try and see what is talking on the network when the problem manifests. But I've never really found anything like that for a fibre channel SAN. As I said, I'm pretty new to SAN, so any direction would be helpful. Thanks, Sean- Hide quoted text - - Show quoted text - I work as a SAN consultant for HP and I agree that embedding taps into environments is a very good idea. I have three finisar analysers and one of the biggest problems is getting the change approved to add or remove them, getting the customer to install taps removes this obstacle. The Cisco platform does have the SD port (mirror...) functionality but you don't see the whole picture when using it. Last time I was involved with an escalation on MDS then Cisco themselves asked for a finisar trace. Kind Regards Jason |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Can USB device work sans OS? | Mike Hollywood | Homebuilt PC's | 4 | August 19th 05 05:43 PM |
HP Checklist for SANs | [email protected] | Storage & Hardrives | 1 | April 14th 05 11:53 AM |
Where do SERDES play in SANs? | Frank Poon | Storage & Hardrives | 0 | November 10th 04 11:38 PM |
SANs : must read | ccie_san | Storage & Hardrives | 1 | October 22nd 03 06:49 AM |
Veritas for SANs | Jochen Berner | Storage & Hardrives | 0 | October 14th 03 02:36 PM |