A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » General Hardware & Peripherals » Storage & Hardrives
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

Troubleshooting SANs



 
 
Thread Tools Display Modes
  #1  
Old March 10th 07, 06:20 PM posted to comp.arch.storage
[email protected]
external usenet poster
 
Posts: 2
Default Troubleshooting SANs

I work for a consulting firm, and have begun to do troubleshooting on
small SANs, mostly HP MSA1500cs based.

Many times the problem the customer is talking about is some vague
intermittent slowness issue or something like that. In cases like
this, my troubleshooting goes something like this:

1. Check switch logs for marginal ports or other errors (usually
brocade 4/24s or similar)
2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
etc.

If the problem still exists, I'll call HP support, but more often than
not they can't really help from here. So the only approach that
yields results is to start unplugging stuff until I see the problem
disappear.

In one recent instance, I had a customer start shutting blades off
until he found that one of them had an HBA that was mysteriously
causing the intermittent slowness for the whole SAN. The HBA actually
seemed to work, and there were no errors in the Windows event logs, or
switch logs, sansurfer, or anything.

There has got to be a better way to find this kind of thing. On an IP
network, I would run Ethereal or some other packet analyzer to try and
see what is talking on the network when the problem manifests. But
I've never really found anything like that for a fibre channel SAN.

As I said, I'm pretty new to SAN, so any direction would be helpful.

Thanks,
Sean

  #2  
Old March 12th 07, 02:07 PM posted to comp.arch.storage
Piotr
external usenet poster
 
Posts: 4
Default Troubleshooting SANs

Uzytkownik napisal w wiadomosci
ups.com...
I work for a consulting firm, and have begun to do troubleshooting on
small SANs, mostly HP MSA1500cs based.

Many times the problem the customer is talking about is some vague
intermittent slowness issue or something like that. In cases like
this, my troubleshooting goes something like this:

1. Check switch logs for marginal ports or other errors (usually
brocade 4/24s or similar)
2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
etc.

If the problem still exists, I'll call HP support, but more often than
not they can't really help from here. So the only approach that
yields results is to start unplugging stuff until I see the problem
disappear.

In one recent instance, I had a customer start shutting blades off
until he found that one of them had an HBA that was mysteriously
causing the intermittent slowness for the whole SAN. The HBA actually
seemed to work, and there were no errors in the Windows event logs, or
switch logs, sansurfer, or anything.

There has got to be a better way to find this kind of thing. On an IP
network, I would run Ethereal or some other packet analyzer to try and
see what is talking on the network when the problem manifests. But
I've never really found anything like that for a fibre channel SAN.

As I said, I'm pretty new to SAN, so any direction would be helpful.

Thanks,
Sean


Hi Sean,

check
http://www.finisar.com/index.php?fil...d%2 0Analysis

Good luck,
Piotr


  #3  
Old March 13th 07, 12:46 AM posted to comp.arch.storage
Sean Howard[_2_]
external usenet poster
 
Posts: 1
Default Troubleshooting SANs

Hi Sean,

check
http://www.finisar.com/index.php?fil...d%2 0Analysis

Good luck,
Piotr


Yeah I found some of that stuff. The problem with everything I've found is
that it requires Taps. I haven't found anything equivalent to a "mirroring
port" on a switch.

Does such a thing exist?

  #4  
Old March 13th 07, 11:50 AM posted to comp.arch.storage
Piotr
external usenet poster
 
Posts: 4
Default Troubleshooting SANs


Uzytkownik "Sean Howard" napisal w wiadomosci
. ..
Hi Sean,

check
http://www.finisar.com/index.php?fil...d%2 0Analysis

Good luck,
Piotr


Yeah I found some of that stuff. The problem with everything I've found
is that it requires Taps. I haven't found anything equivalent to a
"mirroring port" on a switch.

Does such a thing exist?


Yes it does, but not on every product. As far as I am aware you can find it
on Brocade 48000 directors and Brocade 5000 FC switches.

There is a good reason for using Taps in SAN monitoring and troubleshooting
(see below as found in a Finsar document covering this problem).
1. Multiple ports mirrored to one port causes buffer overflow and dropped
packets.
2. Packets go through a buffer and are retimed, making accurate time
sensitive measurements impossible, such as jitter, packet gap analysis, or
latency.
3. Most mirror ports filter anomalies, thus making troubleshooting
impossible.
4. Turning on port mirroring puts a load on the switch's CPU/transfer logic
thus impacting the switch's operational performance.

Piotr


  #5  
Old March 16th 07, 09:43 PM posted to comp.arch.storage
[email protected]
external usenet poster
 
Posts: 1
Default Troubleshooting SANs

Since when do 48k's (or *any* Brocade switch) support port mirroring?
One would think the Condor's could handle it, but I've never seen it
implemented in Brocade's product line. I'm not sure about Cisco.

To the OP, the only way I know of is tapping the fabric. There are FC
protocol analyzers, but they sit in band.

-Mark

On Mar 13, 7:50 am, "Piotr" wrote:
Uzytkownik "Sean Howard" napisal w wiadomoscinews:vZ6dnQ0LoffOaWjYnZ2dnUVZ_oCmnZ2d@co mcast.com...

Hi Sean,


check
http://www.finisar.com/index.php?fil...ct&div_id=smen...


Good luck,
Piotr


Yeah I found some of that stuff. The problem with everything I've found
is that it requires Taps. I haven't found anything equivalent to a
"mirroring port" on a switch.


Does such a thing exist?


Yes it does, but not on every product. As far as I am aware you can find it
on Brocade 48000 directors and Brocade 5000 FC switches.

There is a good reason for using Taps in SAN monitoring and troubleshooting
(see below as found in a Finsar document covering this problem).
1. Multiple ports mirrored to one port causes buffer overflow and dropped
packets.
2. Packets go through a buffer and are retimed, making accurate time
sensitive measurements impossible, such as jitter, packet gap analysis, or
latency.
3. Most mirror ports filter anomalies, thus making troubleshooting
impossible.
4. Turning on port mirroring puts a load on the switch's CPU/transfer logic
thus impacting the switch's operational performance.

Piotr



  #6  
Old March 19th 07, 02:54 AM posted to comp.arch.storage
Moojit
external usenet poster
 
Posts: 18
Default Troubleshooting SANs


wrote in message
ups.com...
I work for a consulting firm, and have begun to do troubleshooting on
small SANs, mostly HP MSA1500cs based.

Many times the problem the customer is talking about is some vague
intermittent slowness issue or something like that. In cases like
this, my troubleshooting goes something like this:

1. Check switch logs for marginal ports or other errors (usually
brocade 4/24s or similar)
2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
etc.

If the problem still exists, I'll call HP support, but more often than
not they can't really help from here. So the only approach that
yields results is to start unplugging stuff until I see the problem
disappear.

In one recent instance, I had a customer start shutting blades off
until he found that one of them had an HBA that was mysteriously
causing the intermittent slowness for the whole SAN. The HBA actually
seemed to work, and there were no errors in the Windows event logs, or
switch logs, sansurfer, or anything.

There has got to be a better way to find this kind of thing. On an IP
network, I would run Ethereal or some other packet analyzer to try and
see what is talking on the network when the problem manifests. But
I've never really found anything like that for a fibre channel SAN.

As I said, I'm pretty new to SAN, so any direction would be helpful.

Thanks,
Sean


You're correct. There is no such thing as port mirroring or fibre channel
software analyzer such as Ethernet's Ethereal. Your best bet in this
scenario without using an inline fibre channel analyzer (Finisar is the
defacto standard) is to use an application such as SCSI Utility For Windows
to monitor the HBA port statistics to determine what errors man be
happening.

The Moojit


  #7  
Old March 24th 07, 11:50 PM posted to comp.arch.storage
Bob S
external usenet poster
 
Posts: 1
Default Troubleshooting SANs

Sean,

I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically
I would guess that the blade you closed down was doing Target Resets.

If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the
outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end
result is all the initiators slow down but no errors are displayed. Zoning won't help.

You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common.
The HP box might provide some higher debug level that exposes target resets so you can track them down.

From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.)
I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers.

Follow http://support.microsoft.com/default...b;EN-US;923830

MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show
up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for
/arbitrat/ as in arbitration or something like that.

There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had
Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems).







wrote:
I work for a consulting firm, and have begun to do troubleshooting on
small SANs, mostly HP MSA1500cs based.

Many times the problem the customer is talking about is some vague
intermittent slowness issue or something like that. In cases like
this, my troubleshooting goes something like this:

1. Check switch logs for marginal ports or other errors (usually
brocade 4/24s or similar)
2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
etc.

If the problem still exists, I'll call HP support, but more often than
not they can't really help from here. So the only approach that
yields results is to start unplugging stuff until I see the problem
disappear.

In one recent instance, I had a customer start shutting blades off
until he found that one of them had an HBA that was mysteriously
causing the intermittent slowness for the whole SAN. The HBA actually
seemed to work, and there were no errors in the Windows event logs, or
switch logs, sansurfer, or anything.

There has got to be a better way to find this kind of thing. On an IP
network, I would run Ethereal or some other packet analyzer to try and
see what is talking on the network when the problem manifests. But
I've never really found anything like that for a fibre channel SAN.

As I said, I'm pretty new to SAN, so any direction would be helpful.

Thanks,
Sean

  #8  
Old April 2nd 07, 10:07 AM posted to comp.arch.storage
[email protected]
external usenet poster
 
Posts: 2
Default Troubleshooting SANs

On 25 Mar, 00:50, Bob S wrote:
Sean,

I'm going to guessing that this wasn't a FC problem. I'm more inclined to believe it was a SCSI problem. Specifically
I would guess that the blade you closed down was doing Target Resets.

If an initiator sends a target reset to a target and this target is providing LUNs for multiple initiators, all the
outstanding IOs to all the initiators get reset. The initiators time out and retry the IO which succeeds. The end
result is all the initiators slow down but no errors are displayed. Zoning won't help.

You can limit the possible suspects by seeing which initiators are slowing down and which target they have in common.
The HP box might provide some higher debug level that exposes target resets so you can track them down.

From my experience, the most likely culprit is a Window 2003 SP1 cluster node (probably with an older storport driver.)
I suggest whenever you see this problem just upgrade all the Windows clusters and all the storport drivers.

Followhttp://support.microsoft.com/default.aspx?scid=kb;EN-US;923830

MSCS use resets to decide quorum ownership and when they get in a pickle, the do too many resets. Too many resets show
up as slow storage. Cluster Nodes do log resets in the cluster log, although they don't call them resets, look for
/arbitrat/ as in arbitration or something like that.

There is also the Emulex TPRLO command which is an FC issue. You can research TPRLOs. If the offending blade had
Emulex cards see if TPRLO was enabled. (By default it shouldn't be and if it is you'll get the same problems).



wrote:
I work for a consulting firm, and have begun to do troubleshooting on
small SANs, mostly HP MSA1500cs based.


Many times the problem the customer is talking about is some vague
intermittent slowness issue or something like that. In cases like
this, my troubleshooting goes something like this:


1. Check switch logs for marginal ports or other errors (usually
brocade 4/24s or similar)
2. Update to latest firmware and driver levels on HBAs, Switch, MSA,
etc.


If the problem still exists, I'll call HP support, but more often than
not they can't really help from here. So the only approach that
yields results is to start unplugging stuff until I see the problem
disappear.


In one recent instance, I had a customer start shutting blades off
until he found that one of them had an HBA that was mysteriously
causing the intermittent slowness for the whole SAN. The HBA actually
seemed to work, and there were no errors in the Windows event logs, or
switch logs, sansurfer, or anything.


There has got to be a better way to find this kind of thing. On an IP
network, I would run Ethereal or some other packet analyzer to try and
see what is talking on the network when the problem manifests. But
I've never really found anything like that for a fibre channel SAN.


As I said, I'm pretty new to SAN, so any direction would be helpful.


Thanks,
Sean- Hide quoted text -


- Show quoted text -


I work as a SAN consultant for HP and I agree that embedding taps into
environments is a very good idea. I have three finisar analysers and
one of the biggest problems is getting the change approved to add or
remove them, getting the customer to install taps removes this
obstacle. The Cisco platform does have the SD port (mirror...)
functionality but you don't see the whole picture when using it. Last
time I was involved with an escalation on MDS then Cisco themselves
asked for a finisar trace.

Kind Regards

Jason



 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Can USB device work sans OS? Mike Hollywood Homebuilt PC's 4 August 19th 05 05:43 PM
HP Checklist for SANs [email protected] Storage & Hardrives 1 April 14th 05 11:53 AM
Where do SERDES play in SANs? Frank Poon Storage & Hardrives 0 November 10th 04 11:38 PM
SANs : must read ccie_san Storage & Hardrives 1 October 22nd 03 06:49 AM
Veritas for SANs Jochen Berner Storage & Hardrives 0 October 14th 03 02:36 PM


All times are GMT +1. The time now is 02:31 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.