HardwareBanter - View Single Post - WAFL writing accross the raid group in a aggregate

**Raju Mahala** · #3 February 2nd 07, 04:59 PM posted to comp.arch.storage

On Feb 2, 3:57 am, Faeandar wrote:
On 1 Feb 2007 05:25:49 -0800, "Raju Mahala"
wrote:

I believe that if a aggregate has more than one raid group then data
writing will be accross all the raid group in horizontal fashion. So
if have bigger aggregate then better throughput due to more spindles.

First any comment on this. Whether I am right or not ?

If I check disk utilizations through statit then sometime I found that
data-transfer command issued per second is not almost equivalent on
the disks accross the raid group in a single aggregate.

For ex. see below :

disk ut% xfers ureads--chain-usecs writes--chain-usecs cpreads-
chain-usecs greads--chain-usecs gwrites-chain-usecs
/aggr0/plex0/rg0:
0c.48 9 9.29 2.30 1.07 23280 6.82 4.29 2727 0.17
6.75 1548 0.00 .... . 0.00 .... .
0c.32 8 8.74 0.37 1.41 101484 8.21 3.90 2261 0.15
3.78 2706 0.00 .... . 0.00 .... .
/aggr1/plex0/rg0:
0c.17 56 48.68 1.47 1.00 8243 34.74 26.74 1022 12.47
10.61 1558 0.00 .... . 0.00 .... .
0c.49 55 49.92 1.45 1.00 23947 35.97 25.90 938 12.50
10.70 1465 0.00 .... . 0.00 .... .
0c.33 70 90.88 30.82 1.23 21822 40.71 17.17 1828 19.35
9.39 2273 0.00 .... . 0.00 .... .
0c.18 67 84.87 27.87 1.29 20410 38.13 18.40 1693 18.87
9.23 2238 0.00 .... . 0.00 .... .
0c.50 65 85.62 27.42 1.21 21775 38.85 17.42 1700 19.35
9.95 2001 0.00 .... . 0.00 .... .
0c.34 68 86.57 27.34 1.23 22603 39.86 17.34 1833 19.37
9.55 2194 0.00 .... . 0.00 .... .
0c.19 67 84.99 26.83 1.26 21149 39.74 17.68 1761 18.42
9.18 2228 0.00 .... . 0.00 .... .
0c.51 65 83.36 25.87 1.27 20110 39.08 17.81 1637 18.41
9.65 1977 0.00 .... . 0.00 .... .
0c.35 68 85.35 28.77 1.21 23676 38.13 18.46 1741 18.45
9.25 2320 0.00 .... . 0.00 .... .
0c.20 67 84.76 27.39 1.23 22127 38.27 17.88 1735 19.10
9.88 2048 0.00 .... . 0.00 .... .
0c.52 69 84.83 28.35 1.27 22185 37.83 18.25 1798 18.65
9.61 2230 0.00 .... . 0.00 .... .
0c.36 68 85.39 27.73 1.27 21596 38.73 17.91 1814 18.93
9.53 2192 0.00 .... . 0.00 .... .
0c.21 67 86.39 28.37 1.27 22485 38.63 17.56 1812 19.39
9.71 2123 0.00 .... . 0.00 .... .
0c.53 69 87.23 28.89 1.26 22340 39.12 17.78 1884 19.21
9.37 2252 0.00 .... . 0.00 .... .
0c.37 69 86.72 27.67 1.27 21195 39.73 17.72 1842 19.32
9.31 2217 0.00 .... . 0.00 .... .
0c.22 68 85.33 27.39 1.24 21374 38.76 18.08 1801 19.18
9.31 2144 0.00 .... . 0.00 .... .
/aggr1/plex0/rg1:
0c.38 58 54.53 0.00 .... . 37.39 27.59 974 17.14
9.69 1608 0.00 .... . 0.00 .... .
0c.54 59 54.79 0.00 .... . 37.65 27.41 1005 17.14
9.75 1650 0.00 .... . 0.00 .... .
0c.23 72 107.07 28.73 1.23 22749 52.13 14.50 1927 26.20
8.20 2296 0.00 .... . 0.00 .... .
0c.39 73 107.10 28.60 1.28 21650 51.87 14.85 1901 26.64
7.80 2418 0.00 .... . 0.00 .... .
0c.55 74 105.45 28.75 1.27 22783 50.68 15.00 1931 26.03
7.93 2471 0.00 .... . 0.00 .... .
0c.24 72 106.05 27.82 1.27 22016 52.02 14.79 1903 26.21
7.61 2392 0.00 .... . 0.00 .... .
0c.40 74 107.03 29.17 1.22 23488 52.14 14.77 1972 25.72
7.82 2526 0.00 .... . 0.00 .... .
0c.56 71 105.81 28.23 1.23 22033 51.59 14.88 1806 25.98
7.91 2191 0.00 .... . 0.00 .... .
0c.25 71 104.19 27.27 1.25 22330 51.15 15.05 1866 25.76
7.86 2252 0.00 .... . 0.00 .... .
0c.41 72 105.07 28.23 1.20 24299 51.23 14.87 1933 25.61
8.01 2369 0.00 .... . 0.00 .... .
0c.57 73 106.22 27.95 1.24 23069 51.88 14.76 1966 26.38
7.76 2409 0.00 .... . 0.00 .... .
0c.26 72 105.71 27.94 1.24 22384 51.79 14.99 1910 25.98
7.59 2376 0.00 .... . 0.00 .... .
0c.42 74 107.23 28.76 1.20 23742 51.98 14.83 1965 26.49
7.46 2531 0.00 .... . 0.00 .... .
0c.58 74 106.30 28.43 1.24 23027 51.76 14.98 1979 26.11
7.74 2459 0.00 .... . 0.00 .... .
0c.27 72 106.53 28.27 1.22 22733 52.02 14.66 1927 26.25
8.26 2184 0.00 .... . 0.00 .... .
0c.43 73 107.20 28.48 1.19 24864 51.43 14.63 1979 27.29
7.95 2325 0.00 .... . 0.00 .... .

Here rg0 has less data-transfer command issued per second than rg1 but
both rg0 and rg1 are part of a single aggregate called aggr1.

Please comment on it why it is so ?

Well, looking at the utilization of your parity drives I would say
these are all reads. In which case the placement of the original data
is what matters most. If the data the reads are requesting are
primarily on one raid group then that raid group is going to do more
work.
I'm no expert on statit but to me it still looks like things are
balanced pretty evenly.

you are correct that written data gets striped across all raid groups
more or less evenly. I say more or less because I do not know what
the alogorithm is to determine where a new write picks up if the last
write did not span one entire raid group. but generally they are
striped across all rg's in the aggregate.

There is a diminishing return on aggregate performance and drive
count. I believe the max was 50 or so drives. After that you get no
added performance benefit from another drive added to the aggregate.
This test was done by NetApp and using their write algorithims and
striping methods. It may not be true for other vendors.

~F- Hide quoted text -

- Show quoted text -

Thanks Faendar, nice comment in details which gave me new direction in
debugging and configuration.
Can you suggest any commands for debugging slow performance and back-
back to CP. I normally use sysstat, "qtree stats", and statit.
Thanks once again