Scaling storage performance and capacity horizontally?

**[email protected]** · #1 December 21st 06, 01:57 AM posted to comp.arch.storage

Hi

I have a question about storage management, or disk load-balancing,
or.. I'm not sure.

I am having trouble working out how to scale storage horizontally.

We have a 16x320GB SATA-II RAID-5 array on our internet download
server, which offers a mix of big and small files, (e.g. ISO images,
software patches, etc).

We often comfortably serve 1000-2000 concurrent clients. However, this
is only when the majority of downloads are small files ( = 1MB).
Whenever we release a new ISO image, or even something just 100MB, the
server starts to struggle.. even with only half our usual amount of
downloaders.

I am guessing this is because the slower clients are unable to take
advantage of reading the files sequentially, and as such create more of
a random IO situation, and we are maxing out IOPS?

I can alleviate the problem right now by just buying another RAID-5
array and PC and doing DNS round robin, but then I am maintaining two
copies of our data and it just doesn't really feel like the optimal
solution.

I am wondering how people usually scale their storage, for both
capacity and IOPS, whilst keeping it easy to manage. I guess this
requires either some kind of fancy management software, or perhaps
using something like iSCSI with commodity hardware?

I am imagining a horizontally scalable storage architecture to which I
can simply add/remove hardware as our client-base changes in size. It
has one global namespace where I place the file once and then the
storage management system works out how to distribute it across
spindles/controllers to achieve the best performance -- if a file is
being requested far more than any other, I guess it would be replicated
more.

Does something like this exist, or perhaps I am going about it the
wrong way?

Any advice would be greatly appreciated!

**Faeandar** · #2 December 21st 06, 03:11 AM posted to comp.arch.storage

On 20 Dec 2006 17:57:51 -0800, wrote:

Hi

I have a question about storage management, or disk load-balancing,
or.. I'm not sure.

I am having trouble working out how to scale storage horizontally.

We have a 16x320GB SATA-II RAID-5 array on our internet download
server, which offers a mix of big and small files, (e.g. ISO images,
software patches, etc).

We often comfortably serve 1000-2000 concurrent clients. However, this
is only when the majority of downloads are small files ( = 1MB).
Whenever we release a new ISO image, or even something just 100MB, the
server starts to struggle.. even with only half our usual amount of
downloaders.

I am guessing this is because the slower clients are unable to take
advantage of reading the files sequentially, and as such create more of
a random IO situation, and we are maxing out IOPS?

I can alleviate the problem right now by just buying another RAID-5
array and PC and doing DNS round robin, but then I am maintaining two
copies of our data and it just doesn't really feel like the optimal
solution.

I am wondering how people usually scale their storage, for both
capacity and IOPS, whilst keeping it easy to manage. I guess this
requires either some kind of fancy management software, or perhaps
using something like iSCSI with commodity hardware?

I am imagining a horizontally scalable storage architecture to which I
can simply add/remove hardware as our client-base changes in size. It
has one global namespace where I place the file once and then the
storage management system works out how to distribute it across
spindles/controllers to achieve the best performance -- if a file is
being requested far more than any other, I guess it would be replicated
more.

Does something like this exist, or perhaps I am going about it the
wrong way?

Any advice would be greatly appreciated!

Look at Isillon, Panasas, Exanet, Ibrix, etc. Or you can even start
small and add a grip of RAM to your existing server.

From what you describe, and I may be mis-reading, you run into a
problem when you have clients asking for the same ISO (or something
similar). To me this means you're blowing cache with the requests, or
your server is not storage bound but rather system bound. If the data
is all in cache on the server, and it's still slow, you have a system
bottleneck. CPU and network would be the frontrunners.

16 drives is a decent stripe width and everything you mention is
reads, so I doubt the disks are your bottleneck. And in fact the
performance would increase the larger the file(s) since this would
generally mean more sequential IO.

~F

**Bill Todd** · #3 December 21st 06, 03:33 AM posted to comp.arch.storage

wrote:

....

We have a 16x320GB SATA-II RAID-5 array on our internet download
server, which offers a mix of big and small files, (e.g. ISO images,
software patches, etc).

We often comfortably serve 1000-2000 concurrent clients. However, this
is only when the majority of downloads are small files ( = 1MB).
Whenever we release a new ISO image, or even something just 100MB, the
server starts to struggle.. even with only half our usual amount of
downloaders.

I am guessing this is because the slower clients are unable to take
advantage of reading the files sequentially, and as such create more of
a random IO situation, and we are maxing out IOPS?

Unlikely, since the hot file is probably getting loaded into the array
RAM (or the individual on-disk caches) and being served from there,
regardless of how fast the individual clients can inhale it. Bandwidth
(at the array interface or in the network) is the far more likely
bottleneck.

....

I am wondering how people usually scale their storage, for both
capacity and IOPS, whilst keeping it easy to manage. I guess this
requires either some kind of fancy management software,

Not with contemporary array technology - but most arrays aren't exactly
state of the art.

or perhaps
using something like iSCSI with commodity hardware?

I am imagining a horizontally scalable storage architecture to which I
can simply add/remove hardware as our client-base changes in size. It
has one global namespace where I place the file once and then the
storage management system works out how to distribute it across
spindles/controllers to achieve the best performance -- if a file is
being requested far more than any other, I guess it would be replicated
more.

Does something like this exist

Yes, and you don't even have to place your faith in some relatively
small file-system start-up to obtain it - because it's available at the
array level from products like HP's EVA line (which stripes your data
across as many drives as you want it to and adjusts for increases and
decreases in that number - IIRC Xiotech's Magnitude line does something
similar).

However, as noted above it doesn't sound as if you need anything like
this to solve your current problem: the combined bandwidth of 32 SATA
drives serving up a large file (even if not cached) should approach (or
possibly exceed) 1 GB per second, and with the file cached (either in
array RAM or in the on-disk caches) that figure should rise to 5 - 10
GB/sec (limited by the SATA I or II bus bandwidth), so unless your array
ports and network can handle this kind of bandwidth your problem likely
lies elsewhere.

- bill

**[email protected]** · #4 December 21st 06, 05:55 AM posted to comp.arch.storage

Bill Todd wrote:
wrote:

...

We often comfortably serve 1000-2000 concurrent clients. However, this
is only when the majority of downloads are small files ( = 1MB).
Whenever we release a new ISO image, or even something just 100MB, the
server starts to struggle.. even with only half our usual amount of
downloaders.

I am guessing this is because the slower clients are unable to take
advantage of reading the files sequentially, and as such create more of
a random IO situation, and we are maxing out IOPS?

Unlikely, since the hot file is probably getting loaded into the array
RAM (or the individual on-disk caches) and being served from there,
regardless of how fast the individual clients can inhale it. Bandwidth
(at the array interface or in the network) is the far more likely
bottleneck.

Ok this makes perfect sense. But I know our Gigabit ethernet can push
harder, and from what I can tell the SAS array interface should
outperform that.

When the server starts to struggle, I do see 100% utilization in
iostat(1) output for the array though, with numbers something like
this:

r/s 500
rsec/s 280000
rkB/s 140000
avgrq-sz 500
avgqu-sz 10-20
await 100-200ms
svctm 2-4ms
%util 100

During this time I can still dd(1) a cold file to /dev/null from the
array at around 100MiB/s, so I'm not entirely sure what to make of
this. That is why I was thinking it had something to do with the
slower clients.

Is it possible that the problem is just system/controller RAM? And
that the burst of new webserver connections for the popular file are
taking cache memory away from the (many many) other less popular files
still being requested, thus resulting in extra traffic to the disk?

Is the 500 r/s quoted above going to be near the upper limit of 14x
SATA-II 7200RPM NCQ disks?

...

However, as noted above it doesn't sound as if you need anything like
this to solve your current problem:

This is the impression I am getting. I feel like I'm missing something
painfully obvious, but I really don't know what it is.

**Bill Todd** · #5 December 21st 06, 02:51 PM posted to comp.arch.storage

wrote:
Bill Todd wrote:

I obviously was less than fully awake when I wrote my first response,
since I quoted bandwidth figures for 32 disks rather than 16 (so halve
those numbers accordingly).

wrote:

...

We often comfortably serve 1000-2000 concurrent clients. However, this
is only when the majority of downloads are small files ( = 1MB).

How *much* less than 1 MB in size could be significant. IIRC average
file sizes for Web server accesses are a few KB - which means that the
set-up overheads may significantly exceed the actual transfer overheads.
But if your server is primarily dedicated to user-initiated file
downloads, then the average size may be closer to the average size of
the target files (and I don't know exactly how much folderol is involved
in setting up the transfer and wrapping up the data in HTTP form).

Whenever we release a new ISO image, or even something just 100MB, the
server starts to struggle.. even with only half our usual amount of
downloaders.

Half the usual number of downloaders operating at several times the
average download efficiency (because they're streaming data rather than
performing smaller separate requests) could easily constitute a far
higher load on the array (and its associated links).

I am guessing this is because the slower clients are unable to take
advantage of reading the files sequentially, and as such create more of
a random IO situation, and we are maxing out IOPS?
Unlikely, since the hot file is probably getting loaded into the array
RAM (or the individual on-disk caches) and being served from there,
regardless of how fast the individual clients can inhale it. Bandwidth
(at the array interface or in the network) is the far more likely
bottleneck.

Ok this makes perfect sense. But I know our Gigabit ethernet can push
harder,

Harder than what? A single GigE link can at best transfer 110 - 115
MB/sec using TCP/IP - likely considerably less if a lot of transfers are
small ones and/or jumbo frames aren't in use.

and from what I can tell the SAS array interface should
outperform that.

A single SAS bus should certainly out-perform a single GigE link.

When the server starts to struggle, I do see 100% utilization in
iostat(1) output for the array though, with numbers something like
this:

r/s 500
rsec/s 280000
rkB/s 140000
avgrq-sz 500
avgqu-sz 10-20
await 100-200ms
svctm 2-4ms
%util 100

Light may finally be starting to dawn on me: this is a JBOD made into a
RAID-5 array by software, not a hardware-managed RAID-5 array?

If so, your system may be performing a lot more individual disk accesses
- enough to saturate its CPU (if indeed that's what the util entry above
indicates).

During this time I can still dd(1) a cold file to /dev/null from the
array at around 100MiB/s, so I'm not entirely sure what to make of
this.

Could these requests be executing at a higher priority than the
web-servicing requests?

That is why I was thinking it had something to do with the
slower clients.

Is it possible that the problem is just system/controller RAM? And
that the burst of new webserver connections for the popular file are
taking cache memory away from the (many many) other less popular files
still being requested, thus resulting in extra traffic to the disk?

Is the 500 r/s quoted above going to be near the upper limit of 14x
SATA-II 7200RPM NCQ disks?

Possibly (the number should perhaps be 15 disks is you're using a single
spa RAID-5 is better for reads than RAID-4 in that respect). It's
under half the IOPS you should get for small requests, but not as bad
for the average request size that's being reported (given that you state
other activity is still occurring in parallel, so the drives can't just
stream the large files out but instead are seeking back and forth
between them and other requested data).

You say 'NCQ', but are you sure that the disks are being used that way?
If the queue lengths being reported are in a queue managed in the
server by software rather than at the disks themselves, NCQing won't be
any benefit. Not to mention the fairness issues of letting the disks
simply optimize throughput, regardless of how long it may make some
client wait.

- bill

#6 December 21st 06, 07:41 PM posted to comp.arch.storage

First off, I'm not much of an expert on smallish storage systems. I
would feel more comfortable if the question were about a PB class
system connected at many GBytes/s.

In article .com,
wrote:
I have a question about storage management, or disk load-balancing,
or.. I'm not sure.

I think the key lies in the "I'm not sure". Do you know what your
bottleneck really is? And even if you know your one smallest
bottleneck, do you know what other system components are next in line
to become bottlenecks? Until you know that, it's hard to improve the
system.

The iostat you posted is part of the answer: One component (namely the
disk IO) seems to be 100% utilized; that is probably the bottleneck.
So we can either give you a much more powerful disk IO and disk
subsystem, or make it so the workload doesn't actually hit the disk.

But let's go through a few numbers . You say your server is connected
via a single GBit Ethernet. That means that you need to serve ONLY
100 MBytes/s of bandwidth. On a high-powered motherboard with enough
CPU power (say for grins dual Opteron, each dual-core, meaning an
investment of maybe $2000 for motherboard and chips), that should be
trivial, even with the CPU load of your web server. Have you actually
checked that you are not CPU starved?

Whenever we release a new ISO image, or even something just 100MB, the
server starts to struggle.. even with only half our usual amount of
downloaders.

Then you say that things go to hell when serving big files. Strange.
Small files should present a much larger load *** if the outgoing
total bandwidth is held constant ***, because the ratio of CPU- and
disk-consuming metadata operations and seeks is much larger. But for
both small and big files, the metadata and seek problem should be
cured by adding cache. Have you tried this: Add a dozen or two GByte
of memory to the server? That will make sure that nearly all the
files that are being constantly served (for example a few ISO images)
are in cache, and don't hit the disk array. This might be
particularly important of you are using software RAID (which can
consume huge amounts of CPU power for writes, huge amount of disk IO
for writes smaller than a stripe, and huge amount of memory bandwidth
for copying the data into a variety of IO buffers).

Here are two other suggestions. You say you are using RAID-5 over 16
disks. Do you actually need the capacity advantage of RAID-5? If you
are using software RAID, it might be a good idea to reconfigure your
array for RAID-10 (meaning mirroring and concatenation of disks).
This might greatly lower your CPU overhead.

Second, your workload should be nearly completely read-only. Meaning
that there should be no writes to the disks. Meaning that things
should run like a bat out of hell even with RAID-5, because there are
no disk updates (meaning ho read-modify-write cycles for sub-stripe
updates). So maybe your problem is that you have a lot of disk writes
going? Here is a suggestion: what file system are you using? Could
it be that the file system is updating atime (last access time)
whenever a client reads a file, and the thing that is killing you is
not actually the reads, but the small writes that come from atime
updates? Try this: mount your data read-only, or get your filesystem
to disable atime updates. Now, if you are running some HSM or ILM
software, the lack of atime updates might break it, so be a little
careful.

I am guessing this is because the slower clients are unable to take
advantage of reading the files sequentially, and as such create more of
a random IO situation, and we are maxing out IOPS?

Shouldn't. Any good file system (which file system are you using?)
should be prefetching enough data into the cache to make the disk IO
sequential enough to get good efficiency. But do you have enough
cache?

I can alleviate the problem right now by just buying another RAID-5
array and PC and doing DNS round robin, but then I am maintaining two
copies of our data and it just doesn't really feel like the optimal
solution.

If you want to double your disk capacity, I would rather go for
RAID-10 instead, and stick to a single server (maybe by upgrading your
server to a real big one). The management overhead of having to
maintain two copies of the data, and the headaches when the two copies
diverge (which they are guaranteed to, unless you are really careful)
is probably worse than the hardware cost of throwing more iron at it.

I am wondering how people usually scale their storage, for both
capacity and IOPS, whilst keeping it easy to manage. I guess this
requires either some kind of fancy management software, or perhaps
using something like iSCSI with commodity hardware?

How do people scale their storage? By throwing money at it. Lots of
it. Your problem could be trivially solved by using high-end servers
(spend a few M$ on IBM AIX or HP HP-UX hardware), a few M$ on a SAN
(more Brocade gear than the police allows), a few M$ on storage
servers (I like the high-end Hitachi, although IBM's Shark is also
pretty good at feeding high-bandwidth workloads), and a lot of money
AND time on management software (EMC's control center seems to be the
400lbs gorilla in that market, for better or for worse). For good
measure, throw in a good cluster file system to make sure your data is
consistent (see above about having two copies of the data); I'm quite
partial towards GPFS, but then I'm highly biased, so take that advice
with a grain of sale. By going to Panasas, LeftHand Networks, iBrix,
Isalon and such you can get similar performance, for somewhat less
money than Tier-1 gear. Look at the big Livermore system: a 2PB
single file system, tens of thousands of disks, thousands of hosts,
zillions of dollars. It can be done. Your problem is that you want
to have mid-range performance using low-end dollars. This can be
done, but is fraught with pitfalls.

I am imagining a horizontally scalable storage architecture to which I
can simply add/remove hardware as our client-base changes in size.

You are dreaming the same dream as most researchers in this field.

It
has one global namespace

Cluster or distributed file system gives you that.

where I place the file once and then the
storage management system works out how to distribute it across
spindles/controllers to achieve the best performance -- if a file is
being requested far more than any other, I guess it would be replicated
more.

The holy grail. Everyone wants it. Some systems are getting somewhat
close to it. Panasas and LeftHand are probably the close commercial
approxiamation to it today (and I'm probably ommiting many others that
can do similar things). But going for interestingly intelligent
self-managing systems is probaby total overkill for your situation.
That's the kind of solution the likes of Goldman-Sachs and
Morgan-Stanley may be toying with in their data centers, and the kind
of thing that Google and Livermore are actually using in production.
Your problem is much smaller scale, and can hopefully be solved
without having to use heavyweight solutions.

--
The address in the header is invalid for obvious reasons. Please
reconstruct the address from the information below (look for _).
Ralph Becker-Szendy

**Bakul Shah** · #7 December 21st 06, 08:40 PM posted to comp.arch.storage

wrote:
...
We often comfortably serve 1000-2000 concurrent clients. However, this
is only when the majority of downloads are small files ( = 1MB).
Whenever we release a new ISO image, or even something just 100MB, the
server starts to struggle.. even with only half our usual amount of
downloaders.

How did you measure you can do 1000 to 2000
concurrent clients? Even if they all connect at
the same time, typically a much smaller number of
requests will hit the array as many will likely be
served from the file system cache.

...
Ok this makes perfect sense. But I know our Gigabit ethernet can push
harder, and from what I can tell the SAS array interface should
outperform that.

This means you can push about 100MBps.

During this time I can still dd(1) a cold file to /dev/null from the
array at around 100MiB/s, so I'm not entirely sure what to make of
this. That is why I was thinking it had something to do with the
slower clients.

This means likely the array B/W is not a problem.

With 1000 requests (assuming fair serving) each
request will get about 100KBps of net bandwidth
(given a Gig interface). Which means about 16.7
minutes per client per 100MB request. This seems
very doable. Also, ideally an entire ISO should
fit in your filesystem cache if you have 2GB or so
of RAM.

Before trying out *any* new solutions I'd first
look at the system and figure out where the
bottleneck is and if it can be removed. For
example, if your server is forking 1000 processes
each is running perl that could eat up a lot of
memory/cpu resources! Or may be you need to up
system buffer resources or something -- you need to
"tune" the entire system. Any number of
possibilities exist and without understanding the
real problem you may settle on the wrong answer!

In addition to iostat, do vmstat as well as ps axl
to check memory and cpu resource use. Also look at
network statistics. Your app. should be sending
out largest size packets and the network interface
should show no errors. Measure network b/w for
sustained traffic. It could be something as dumb
as full dupliex/half duplex confusion or interface
running at 100Mbps instead of 1Gbps.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Xbox360, PS3 : announcements of the death of PC gaming seem premature........	mace	Nvidia Videocards	10	June 30th 05 08:25 PM
Implementing a RAID System	Chris Guimbellot	General	33	February 3rd 05 09:43 AM
PCI-E vs AGP	Ryan J. Paque	Nvidia Videocards	30	December 25th 04 03:32 PM
15K rpm SCSI-disk	Ronny Mandal	General	26	December 8th 04 08:04 PM
Performance after your hard drive reaches 50% capacity	Dan Irwin	Storage (alternative)	11	October 6th 04 06:02 PM