write delays on writes to EMC Symetrix

#1 December 23rd 03, 09:27 PM

Hi

We have an app that writes to a 700MB memory mapped file on an EMC Sym from
a Sun V480 with JNI HBAs. The Sym has 16Gb of cache.

On occasions the app users complain that the app freezes.

I wrote a perl script to simulate the app, and measure write performance. My
script re-writes a 500MB file, 500 bytes at a time. The script runs flat out
writing 500 meg in approx 70 secs - no problem, however I have noticed that
individual writes are sometimes very slow.

I measure the time for each write, and store the times in an array, and then
after the writes completes, I print out the write times, typical output.

0.000037
0.000038
0.000037
0.000037
1.234568
0.000967
0.000532
0.000214
0.000065
0.000037

There seems to be some correlation between SAN activity and these delayed
writes, they are very noticeable during the night backups, but were totally
absent during the weekend.

Typically I get one or two slow writes taking 0.5 secs in every million
writes. There is no regularity to the delays, we do see fsflush induced
delays, these are typically around 0.1 secs and are quite regular.

We were originally running ufs with logging, and changed to vxfs, this
improved from average 20 slow writes to average of two slow writes, we also
installed EMC Powerpath - this had no effect. We thought originally that
this may be extent allocation related, that's why I re-write an existing
file.

The phenomena seems to affect a large number of hosts attached to different
Syms, it is also present on hosts with Emulex HBAs, and is also observed
from Egenera Linux blades.

All the Sun hosts have SRDF, so that was another suspect, but the Linux
blades have no SRDF - there goes another theory.

The huge Sym cache would seem to rule out disk temp recals, so I am now
puzzled.

Any ideas

Pedro

#2 December 24th 03, 02:59 PM

In article , "pedro d" writes:
Hi

We have an app that writes to a 700MB memory mapped file on an EMC Sym from
a Sun V480 with JNI HBAs. The Sym has 16Gb of cache.

On occasions the app users complain that the app freezes.

I wrote a perl script to simulate the app, and measure write performance. My
script re-writes a 500MB file, 500 bytes at a time. The script runs flat out
writing 500 meg in approx 70 secs - no problem, however I have noticed that
individual writes are sometimes very slow.

I measure the time for each write, and store the times in an array, and then
after the writes completes, I print out the write times, typical output.

0.000037
0.000038
0.000037
0.000037
1.234568
0.000967
0.000532
0.000214
0.000065
0.000037

There seems to be some correlation between SAN activity and these delayed
writes, they are very noticeable during the night backups, but were totally
absent during the weekend.

Typically I get one or two slow writes taking 0.5 secs in every million
writes. There is no regularity to the delays, we do see fsflush induced
delays, these are typically around 0.1 secs and are quite regular.

We were originally running ufs with logging, and changed to vxfs, this
improved from average 20 slow writes to average of two slow writes, we also
installed EMC Powerpath - this had no effect. We thought originally that
this may be extent allocation related, that's why I re-write an existing
file.

The phenomena seems to affect a large number of hosts attached to different
Syms, it is also present on hosts with Emulex HBAs, and is also observed
from Egenera Linux blades.

All the Sun hosts have SRDF, so that was another suspect, but the Linux
blades have no SRDF - there goes another theory.

The huge Sym cache would seem to rule out disk temp recals, so I am now
puzzled.

Any ideas

I've no PermaCache experience, but the PermaCache file has
to be tied to physical storage.

Regardless of what it is tied to... the writes are destaged
to physical disk. Find out from your Storage Administrator
what it is mapped to. Go in WLA, metrics, Disks and look at
that disk:

write commands per sec
Kbytes written per sec
seeks per sec
average hypers per seek
average kbyes per write

You are on the right track. That hyper your small 700 MByte file
is mapped to may be red-hot due to write traffic but also since
that hyper is just one slice of that physical disk, reads and
writes to that hyper may indeed be impeding the writes flushed
from PermaCache to that hyper. What will clue us in is how
much seeking it is doing, especially average hypers per seek
correlated to your slow times. It is a manner of correlation.

Second thing that may be at issue as you don't mention if they
are sequential writes. What type of write traffic and how is
the Sym able to combine writes? Writing 500 MBytes at 500
bytes per second is somewhere around 14000 writes per second. The
Sym surely has a threshold whereby it de-stages it could just be
a matter of it can't write any faster when it goes to write. That
is why KBytes per sec, write commands per sec, kbytes per write is
listed above. You can see if it is a matter of saturation during
the "hangs" (very good possibiity). And that will jump out at
you when you view the graphs.

Now the question is... can you map that PermaCache file to
a large Meta? That way you would be de-staging your writes to
many disks instead of just a hyper. Yes, built-in assumption is
you are mapped to a hyper.

Rob

#3 December 25th 03, 05:13 PM

Thanks Rob

The disk partition is in fact a slice on a 4 disk Meta.

I am doing sequential writes to the file.

The application is in fact a trading app, that writes to several files, a
memory mapped database file sitting behind a Sybase Open Server instance, a
log file and a transaction number mmap file. There are some less important
files.

The database file is created daily and is approx 700 MBytes. The tx number
file is just a few bytes. The log file grows to about 700MB in one day's
trading.

Each trade is entered in the database - and hence the mmap, and written to
the log. Ideally I would like to move the logs to a separate Meta, but the
app currently does not allow this.

I am pretty sure that the Meta itself is not write bound, and stats from the
WLA do not show any hot spots - the problem is that WLA is not very
granular, and does not seems to be capable of showing stats for very short
time periods ( I may be wrong here ).

I noticed that if I increase my individual write from 500 Bytes to 1k, the
test time increase only slightly, which suggests to me that I am far from
saturation of the cache, after all I am still only writing something like
10MB per sec and as my data is being striped across 4 disks...

Any idea how de-staging works, can it cause IO blocking during a de-stage -
I would have expected the cache to be configured as a circular pair of
fifos, so that writes go to one fifo whilst the second is de-staged.

I know that we have 16GB of cache, but is this cache subdivided in any way
so that a particular meta is only served with a small amount of cache?

Pedro

"Rob Young" wrote in message
...
In article , "pedro d"
writes:
Hi

We have an app that writes to a 700MB memory mapped file on an EMC Sym
from
a Sun V480 with JNI HBAs. The Sym has 16Gb of cache.

On occasions the app users complain that the app freezes.

I wrote a perl script to simulate the app, and measure write
performance. My
script re-writes a 500MB file, 500 bytes at a time. The script runs flat
out
writing 500 meg in approx 70 secs - no problem, however I have noticed
that
individual writes are sometimes very slow.

I measure the time for each write, and store the times in an array, and
then
after the writes completes, I print out the write times, typical output.

0.000037
0.000038
0.000037
0.000037
1.234568
0.000967
0.000532
0.000214
0.000065
0.000037

There seems to be some correlation between SAN activity and these
delayed
writes, they are very noticeable during the night backups, but were
totally
absent during the weekend.

Typically I get one or two slow writes taking 0.5 secs in every
million
writes. There is no regularity to the delays, we do see fsflush induced
delays, these are typically around 0.1 secs and are quite regular.

We were originally running ufs with logging, and changed to vxfs, this
improved from average 20 slow writes to average of two slow writes, we
also
installed EMC Powerpath - this had no effect. We thought originally that
this may be extent allocation related, that's why I re-write an existing
file.

The phenomena seems to affect a large number of hosts attached to
different
Syms, it is also present on hosts with Emulex HBAs, and is also observed
from Egenera Linux blades.

All the Sun hosts have SRDF, so that was another suspect, but the Linux
blades have no SRDF - there goes another theory.

The huge Sym cache would seem to rule out disk temp recals, so I am now
puzzled.

Any ideas

I've no PermaCache experience, but the PermaCache file has
to be tied to physical storage.

Regardless of what it is tied to... the writes are destaged
to physical disk. Find out from your Storage Administrator
what it is mapped to. Go in WLA, metrics, Disks and look at
that disk:

write commands per sec
Kbytes written per sec
seeks per sec
average hypers per seek
average kbyes per write

You are on the right track. That hyper your small 700 MByte file
is mapped to may be red-hot due to write traffic but also since
that hyper is just one slice of that physical disk, reads and
writes to that hyper may indeed be impeding the writes flushed
from PermaCache to that hyper. What will clue us in is how
much seeking it is doing, especially average hypers per seek
correlated to your slow times. It is a manner of correlation.

Second thing that may be at issue as you don't mention if they
are sequential writes. What type of write traffic and how is
the Sym able to combine writes? Writing 500 MBytes at 500
bytes per second is somewhere around 14000 writes per second. The
Sym surely has a threshold whereby it de-stages it could just be
a matter of it can't write any faster when it goes to write. That
is why KBytes per sec, write commands per sec, kbytes per write is
listed above. You can see if it is a matter of saturation during
the "hangs" (very good possibiity). And that will jump out at
you when you view the graphs.

Now the question is... can you map that PermaCache file to
a large Meta? That way you would be de-staging your writes to
many disks instead of just a hyper. Yes, built-in assumption is
you are mapped to a hyper.

Rob

#4 December 29th 03, 05:50 PM

In article , "pedro d" writes:
Thanks Rob

The disk partition is in fact a slice on a 4 disk Meta.

You mean either:

"4-way?" If so, 8 physical disks involved.
"2-way?" If so, 4 physical disks involved.

I am doing sequential writes to the file.

Okay. So in an ideal situation it is taking those 14000-15000
500 Byte writes and making about 100 64K writes out of them.
Or is it? Don't know.

I am pretty sure that the Meta itself is not write bound, and stats from the
WLA do not show any hot spots - the problem is that WLA is not very
granular, and does not seems to be capable of showing stats for very short
time periods ( I may be wrong here ).

It isn't just about hot spots, but are you IO bound or pending
IO at -any- point (a cause of delays)? That can be tricky to determine.

I noticed that if I increase my individual write from 500 Bytes to 1k, the
test time increase only slightly, which suggests to me that I am far from
saturation of the cache, after all I am still only writing something like
10MB per sec and as my data is being striped across 4 disks...

But go back and look at the Disks that are part of that Meta.
If for example you are doing 100+ seeks/sec that cross hyper boundaries
at the same time you are putting out heavy write traffic, you have
a candidate. Another red flag is looking at hosts that are
using storage associated with those 8 physical disks that the
4-way meta is associated with. Do these hosts report long read
queue lengths at the time the writes go longer in duration? The
problem is... the Sym doesn't report read queue depth. You
have to reverse engineer it. These five things
to look at in the last post:

write commands per sec
Kbytes written per sec
seeks per sec
average hypers per seek
average kbyes per write

Will help to see if there is an underlying issue. write commands
per sec to the meta partitions will tell you how well it combined
those 500 byte writes. Analysis and coorelation will tell you
just what is occuring.

Coorelation is what I used to track down a bottleneck, and write up
a summary for internal consumption.

Why are your writes running longer? The Sym hasn't ACKed them, busy
destaging, maybe? Either way, you should be able to confirm the
disks are the bottleneck or they aren't as you track down the cause.

Any idea how de-staging works, can it cause IO blocking during a de-stage -
I would have expected the cache to be configured as a circular pair of
fifos, so that writes go to one fifo whilst the second is de-staged.

No idea. Good luck finding that out, let me know if you stumble
upon technical details.

I know that we have 16GB of cache, but is this cache subdivided in any way
so that a particular meta is only served with a small amount of cache?

This I have been "told" - take it for what it is worth. Each
volume has a certain amount of cache associated with it as a "start".
If busy (or whatever internal criteria they key on), it will expand
the cache associated with that volume. It can expand it again, three
expansions (or so I've been informed). I was told that by an EMC rep,
there may be something written about that somewhere but I can't find
it. I don't have figures and what I poorly describe sounds like urban
legend but is all I got. If you ever stumble upon technical details,
drop me a line.

Rob

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
writing cd`s	biggmark	Cdr	7	December 31st 04 08:57 AM
Harddisks: Seek, Read, Write, Read, Write, Slow ?	Marc de Vries	General	7	July 26th 04 02:57 AM
Nero help needed	foghat	Cdr	0	May 31st 04 08:23 PM
Linux hanging while trying to write cd-r.	Jayasheel C H	Cdr	0	September 16th 03 06:04 AM
Help! - The dreaded buffer underrun	XPG	Cdr	5	August 31st 03 06:27 PM