Another AMD supercomputer, 13,000 quad-core

**George Macdonald** · November 15th 06, 02:44 AM posted to comp.sys.ibm.pc.hardware.chips

On Mon, 13 Nov 2006 21:45:26 -0600, "Del Cecchi"
wrote:

"George Macdonald" wrote in
message news
On Mon, 13 Nov 2006 08:23:31 -0600, Del Cecchi

wrote:

George Macdonald wrote:
Did you not notice "high capability"? "Pick a processor" is not
going to
get you that. I haven't seen Del's announcement since I don't take
comp.arch.

You could check for "IBM System Cluster 1350" on IBM's web site
http://www-03.ibm.com/systems/cluste...ware/1350.html
if you are interested.

I guess I don't understand what you mean by "high capability".

Not sure where I got the "high" from:-) but "capability" and "capacity"
seem to be used to contrast the two (extreme point) types of
supercomputers
in articles he http://www.hpcuserforum.com/events/.

--
Rgds, George Macdonald

I didn't see those terms in a quick scan but presumably capability refers
to "big uniprocessors" like cray vector machines (I know they aren't
really uniprocessors these days). I think this niche has largely been
filled by machines like Blue Gene or other clusters. Capacity machines
are just things like what google or yahoo have--a warehouse full of
servers. So infact the Cluster 1350 is a Capability machine. SETI at
home is a capacity machine.

From the last useful Meeting Bulletin, admittedly a while back in April
2001, the sense I get is that anything built out of COTS, tightly coupled
or not, is/was considered "capacity" when compared with Crays and others
with specialized processors. Interestingly, the guy from Ford was one
pushing the need for "capability".

--
Rgds, George Macdonald

**Del Cecchi** · November 15th 06, 04:01 PM posted to comp.sys.ibm.pc.hardware.chips

George Macdonald wrote:
On Mon, 13 Nov 2006 21:45:26 -0600, "Del Cecchi"
wrote:

"George Macdonald" wrote in
message news
On Mon, 13 Nov 2006 08:23:31 -0600, Del Cecchi

wrote:

George Macdonald wrote:

Did you not notice "high capability"? "Pick a processor" is not
going to
get you that. I haven't seen Del's announcement since I don't take
comp.arch.

You could check for "IBM System Cluster 1350" on IBM's web site
http://www-03.ibm.com/systems/cluste...ware/1350.html
if you are interested.

I guess I don't understand what you mean by "high capability".

Not sure where I got the "high" from:-) but "capability" and "capacity"
seem to be used to contrast the two (extreme point) types of
supercomputers
in articles he http://www.hpcuserforum.com/events/.

--
Rgds, George Macdonald

I didn't see those terms in a quick scan but presumably capability refers
to "big uniprocessors" like cray vector machines (I know they aren't
really uniprocessors these days). I think this niche has largely been
filled by machines like Blue Gene or other clusters. Capacity machines
are just things like what google or yahoo have--a warehouse full of
servers. So infact the Cluster 1350 is a Capability machine. SETI at
home is a capacity machine.

From the last useful Meeting Bulletin, admittedly a while back in April
2001, the sense I get is that anything built out of COTS, tightly coupled
or not, is/was considered "capacity" when compared with Crays and others
with specialized processors. Interestingly, the guy from Ford was one
pushing the need for "capability".

Cool. Are there any "capability" machines left in the top500?

--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.”

**George Macdonald** · November 16th 06, 12:55 AM posted to comp.sys.ibm.pc.hardware.chips

On Wed, 15 Nov 2006 10:01:39 -0600, Del Cecchi
wrote:

George Macdonald wrote:
On Mon, 13 Nov 2006 21:45:26 -0600, "Del Cecchi"
wrote:

"George Macdonald" wrote in
message news
On Mon, 13 Nov 2006 08:23:31 -0600, Del Cecchi

wrote:

George Macdonald wrote:

Did you not notice "high capability"? "Pick a processor" is not
going to
get you that. I haven't seen Del's announcement since I don't take
comp.arch.

You could check for "IBM System Cluster 1350" on IBM's web site
http://www-03.ibm.com/systems/cluste...ware/1350.html
if you are interested.

I guess I don't understand what you mean by "high capability".

Not sure where I got the "high" from:-) but "capability" and "capacity"
seem to be used to contrast the two (extreme point) types of
supercomputers
in articles he http://www.hpcuserforum.com/events/.

--
Rgds, George Macdonald

I didn't see those terms in a quick scan but presumably capability refers
to "big uniprocessors" like cray vector machines (I know they aren't
really uniprocessors these days). I think this niche has largely been
filled by machines like Blue Gene or other clusters. Capacity machines
are just things like what google or yahoo have--a warehouse full of
servers. So infact the Cluster 1350 is a Capability machine. SETI at
home is a capacity machine.

From the last useful Meeting Bulletin, admittedly a while back in April
2001, the sense I get is that anything built out of COTS, tightly coupled
or not, is/was considered "capacity" when compared with Crays and others
with specialized processors. Interestingly, the guy from Ford was one
pushing the need for "capability".

Cool. Are there any "capability" machines left in the top500?

Is that like the Billboard "Hot 100"... but for computers?:-) Yeah it's
true that much progress has been made in COTS since 2001 so maybe that is
the future?

--
Rgds, George Macdonald

**Del Cecchi** · November 16th 06, 01:50 AM posted to comp.sys.ibm.pc.hardware.chips

"George Macdonald" wrote in
message ...
On Wed, 15 Nov 2006 10:01:39 -0600, Del Cecchi

wrote:

George Macdonald wrote:
On Mon, 13 Nov 2006 21:45:26 -0600, "Del Cecchi"
wrote:

"George Macdonald" wrote in
message news
On Mon, 13 Nov 2006 08:23:31 -0600, Del Cecchi

wrote:

George Macdonald wrote:

Did you not notice "high capability"? "Pick a processor" is not
going to
get you that. I haven't seen Del's announcement since I don't
take
comp.arch.

You could check for "IBM System Cluster 1350" on IBM's web site
http://www-03.ibm.com/systems/cluste...ware/1350.html
if you are interested.

I guess I don't understand what you mean by "high capability".

Not sure where I got the "high" from:-) but "capability" and
"capacity"
seem to be used to contrast the two (extreme point) types of
supercomputers
in articles he http://www.hpcuserforum.com/events/.

--
Rgds, George Macdonald

I didn't see those terms in a quick scan but presumably capability
refers
to "big uniprocessors" like cray vector machines (I know they aren't
really uniprocessors these days). I think this niche has largely
been
filled by machines like Blue Gene or other clusters. Capacity
machines
are just things like what google or yahoo have--a warehouse full of
servers. So infact the Cluster 1350 is a Capability machine. SETI
at
home is a capacity machine.

From the last useful Meeting Bulletin, admittedly a while back in
April
2001, the sense I get is that anything built out of COTS, tightly
coupled
or not, is/was considered "capacity" when compared with Crays and
others
with specialized processors. Interestingly, the guy from Ford was
one
pushing the need for "capability".

Cool. Are there any "capability" machines left in the top500?

Is that like the Billboard "Hot 100"... but for computers?:-) Yeah
it's
true that much progress has been made in COTS since 2001 so maybe that
is
the future?

--
Rgds, George Macdonald

Blue gene is a network of processors, but not exactly COTS. 240
Teraflops. Number one. That a capacity machine?

**Robert Myers** · November 16th 06, 03:52 AM posted to comp.sys.ibm.pc.hardware.chips

Del Cecchi wrote:

Blue gene is a network of processors, but not exactly COTS. 240
Teraflops. Number one. That a capacity machine?

Linpack flops isn't the only measure of performance that matters. It's
not sensitive to bisection bandwidth, and low bisection bandwidth
forces a particular approach to numerical analysis.

What the whiz kids at LLNL don't seem to get is that localized
approximations will _always_ get the problem wrong for strongly
nonlinear problems, because localized differencing invariably
introduces an artificial renormalization: very good for getting
nice-looking but incorrect answers.

I've discussed this extensively with the one poster to comp.arch who
seems to understand strongly nonlinear systems and he knows exactly
what I'm saying. He won't go public because the IBM/National Labs
juggernaut represents a fair slice of the non-academic jobs that might
be open to him.

The limitations of localized differencing may not be an issue for the
class of problem that LLNL needs to do, but ultimately, you can't fool
mother nature. The bisection bandwidth problem shows up in the poor
performance of Blue Gene on FFT's. My fear about Blue Gene is that it
will perpetuate a kind of analysis that works well for (say) routine
structural analysis, but very poorly for the grand problems of physics
(for example, turbulence and and strongly-interacting systems).

As I'm sure you will say, if you've got enough bucks, you can buy all
the bisection bandwidth you need. As it is, though, all the money
right now is going into linpack-capable machines that will never make
progress on the interesting problems of physics. It's a grand exercise
in self-deception.

Robert.

**Del Cecchi** · November 16th 06, 04:41 AM posted to comp.sys.ibm.pc.hardware.chips

"Robert Myers" wrote in message
ps.com...
Del Cecchi wrote:

Blue gene is a network of processors, but not exactly COTS. 240
Teraflops. Number one. That a capacity machine?

Linpack flops isn't the only measure of performance that matters. It's
not sensitive to bisection bandwidth, and low bisection bandwidth
forces a particular approach to numerical analysis.

What the whiz kids at LLNL don't seem to get is that localized
approximations will _always_ get the problem wrong for strongly
nonlinear problems, because localized differencing invariably
introduces an artificial renormalization: very good for getting
nice-looking but incorrect answers.

I've discussed this extensively with the one poster to comp.arch who
seems to understand strongly nonlinear systems and he knows exactly
what I'm saying. He won't go public because the IBM/National Labs
juggernaut represents a fair slice of the non-academic jobs that might
be open to him.

The limitations of localized differencing may not be an issue for the
class of problem that LLNL needs to do, but ultimately, you can't fool
mother nature. The bisection bandwidth problem shows up in the poor
performance of Blue Gene on FFT's. My fear about Blue Gene is that it
will perpetuate a kind of analysis that works well for (say) routine
structural analysis, but very poorly for the grand problems of physics
(for example, turbulence and and strongly-interacting systems).

As I'm sure you will say, if you've got enough bucks, you can buy all
the bisection bandwidth you need. As it is, though, all the money
right now is going into linpack-capable machines that will never make
progress on the interesting problems of physics. It's a grand exercise
in self-deception.

Robert.

Well the Cluster 1350 has a pretty good network available, if the Blue
Gene one isn't good enough. And Blue Gene was really designed for a few
particular problems, not just Linpack. But the range of problems it is
applicable to seems to be reasonably wide.

And are the "interesting problems in Physics" something that folks are
willing to spend reasonable amounts of money on, like the money spent on
accelerators and nutrino detectors etc? And do they agree as to the kind
of computer needed?

Do you like the new Opteron/Cell Hybrid better? Throwing rocks is easy.
How about specific suggestions?

del

**Robert Myers** · November 16th 06, 05:18 AM posted to comp.sys.ibm.pc.hardware.chips

Del Cecchi wrote:

Do you like the new Opteron/Cell Hybrid better? Throwing rocks is easy.
How about specific suggestions?

I had really hoped to get out of the rock-throwing business.

My criticism really isn't of IBM, which is apparently only giving the
most important customer what it wants. The most important customer
lost interest in science a long time ago, so maybe it doesn't matter
that the machines it buys aren't good science machines.

I'm sure that a good science machine can be built within the parameters
of cluster 1350, and asking how you might go about that would be an
interesting exercise. Sure. The Opteron/Copressor hybrid sounds good.
All that's left to engineer is the network. Were it up to me, I'd
optimize it to do FFT and Matrix transpose. If you can do those two
operations efficiently, you can do an awful lot of very interesting
physics.

The money just isn't there for basic science right now. It isn't IBM's
job to underwrite science or to try to get the government to buy
machines that it apparently doesn't want. The bisection bandwidth of
Blue Gene is millibytes per flop. That's apparently not a problem for
some customers, but there is a big slice of important physics that you
can't do correctly or efficiently with a machine like that.

Robert.

**Del Cecchi** · November 16th 06, 05:01 PM posted to comp.sys.ibm.pc.hardware.chips

Robert Myers wrote:
Del Cecchi wrote:

Do you like the new Opteron/Cell Hybrid better? Throwing rocks is easy.
How about specific suggestions?

I had really hoped to get out of the rock-throwing business.

My criticism really isn't of IBM, which is apparently only giving the
most important customer what it wants. The most important customer
lost interest in science a long time ago, so maybe it doesn't matter
that the machines it buys aren't good science machines.

I'm sure that a good science machine can be built within the parameters
of cluster 1350, and asking how you might go about that would be an
interesting exercise. Sure. The Opteron/Copressor hybrid sounds good.
All that's left to engineer is the network. Were it up to me, I'd
optimize it to do FFT and Matrix transpose. If you can do those two
operations efficiently, you can do an awful lot of very interesting
physics.

The money just isn't there for basic science right now. It isn't IBM's
job to underwrite science or to try to get the government to buy
machines that it apparently doesn't want. The bisection bandwidth of
Blue Gene is millibytes per flop. That's apparently not a problem for
some customers, but there is a big slice of important physics that you
can't do correctly or efficiently with a machine like that.

Robert.

Is BiSection bandwidth really a valid metric for very large clusters?
It seems to me that it can be made arbitrarily small by configuring a
large enough group of processors, since each processor has a finite
number of links. For example a 2D mesh with nearest neighbor
connectivity has a bisection bandwidth that grows as the square root of
the number of processors. But the flops grow as the number of
processors. So the bandwidth per flop decreases with the square root
of the number of processors.

I can't think of why this wouldn't apply in general but don't claim that
it is true. It just seems so to me (although the rate of decrease
wouldn't necessarily be square root)

Apparently no one with money is interested in solving these special
problems for which clusters are not good enough. See SSI and steve
Chen, history of.

--
Del Cecchi
"This post is my own and doesn’t necessarily represent IBM’s positions,
strategies or opinions.”

**Robert Myers** · November 16th 06, 09:16 PM posted to comp.sys.ibm.pc.hardware.chips

Del Cecchi wrote:

Is BiSection bandwidth really a valid metric for very large clusters?

Yes, if you want to do FFT's, or, indeed, any kind of non-local
differencing.

It seems to me that it can be made arbitrarily small by configuring a
large enough group of processors, since each processor has a finite
number of links. For example a 2D mesh with nearest neighbor
connectivity has a bisection bandwidth that grows as the square root of
the number of processors. But the flops grow as the number of
processors. So the bandwidth per flop decreases with the square root
of the number of processors.

That's the problem with the architecture and why I howled so loudly
when it came out. Naturally, I was ridiculed by people whose entire
knowledge of computer architecture is nearest neighbor clusters.

Someone in New Mexico (LANL or Sandia, I don't want to dredge up the
presentation again) understands the numbers as well as I do. The
bisection bandwidth is a problem for a place like NCAR, which uses
pseudospectral techniques, as do most global atmospheric simulations.
The projected efficiency of Red Storm for FFT's was 25%. The
efficiency of Japan's Earth Simulator is at least several times that
for FFT's. No big deal. It was designed for Geophysical simulations,
Blue Gene at Livermore was bought to produce the plots the Lab needed
to justify its own existence (and not to do science). As you have
correctly inferred, the more processors you hang off the
nearest-neighbor network, the worse the situation becomes.

I can't think of why this wouldn't apply in general but don't claim that
it is true. It just seems so to me (although the rate of decrease
wouldn't necessarily be square root)

Unless you increase the aggregate bandwidth, you reach a point of
diminishing returns. The special nature of Linpack has allowed
unimaginative bureacrats to make a career out of buying and touting
very limited machines that are the very opposite of being scalable.
"Scalability" does not mean more processors or real estate. It means
the ability to use the millionth processor as effectively as you use
the 65th. Genuine scalability is hard, which is why no one is really
bothering with it.

Apparently no one with money is interested in solving these special
problems for which clusters are not good enough. See SSI and steve
Chen, history of.

The problems aren't as special as you think. In fact, the glaring
problem that I've pointed out with machines that rely on local
differencing isn't agenda or marketing driven, it's an unavoidable
mathematical fact. As things stand now, we will have ever more
transistors chuffing away on generating ever-less reliable results.

The problem is this: if you use a sufficiently low-order differencing
scheme, you can do most of the problems of mathematical physics on a
box like Blue Gene. Low order schemes are easy to code, undemanding
with regard to non-local bandwidth, and usually much more stable than
very high-order schemes. If you want to figure out how to place an
air-conditioner, they're just fine. If you're trying to do physics,
the plots you produce will be plausible and beautiful, but very often
wrong.

There is an out that, in fairness, I should mention. If you have
processors to burn, you can always overresolve the problem to the point
where the renormalization problem I've mentioned, while still there,
becomes unimportant. Early results by the biggest ego in the field at
the time suggested that it takes about ten times the resolution to do
fluid mechanics with local differencing as accurately as you can do it
with a pseudospectral scheme. In 3-D, that's a thousand times more
processors. For fair comparison, the number of processors in Livermore
box would be divided by 1000 to get equivalent performance to a box
that could do a decent FFT.

Should be posting to comp.arch so people there can switch from being
experts on computer architecture to being experts on numerical analysis
and mathematical physics.

Robert.

**Del Cecchi** · November 17th 06, 01:34 AM posted to comp.sys.ibm.pc.hardware.chips

"Robert Myers" wrote in message
oups.com...
Del Cecchi wrote:

Is BiSection bandwidth really a valid metric for very large clusters?

Yes, if you want to do FFT's, or, indeed, any kind of non-local
differencing.

It seems to me that it can be made arbitrarily small by configuring a
large enough group of processors, since each processor has a finite
number of links. For example a 2D mesh with nearest neighbor
connectivity has a bisection bandwidth that grows as the square root
of
the number of processors. But the flops grow as the number of
processors. So the bandwidth per flop decreases with the square root
of the number of processors.

That's the problem with the architecture and why I howled so loudly
when it came out. Naturally, I was ridiculed by people whose entire
knowledge of computer architecture is nearest neighbor clusters.

Someone in New Mexico (LANL or Sandia, I don't want to dredge up the
presentation again) understands the numbers as well as I do. The
bisection bandwidth is a problem for a place like NCAR, which uses
pseudospectral techniques, as do most global atmospheric simulations.
The projected efficiency of Red Storm for FFT's was 25%. The
efficiency of Japan's Earth Simulator is at least several times that
for FFT's. No big deal. It was designed for Geophysical simulations,
Blue Gene at Livermore was bought to produce the plots the Lab needed
to justify its own existence (and not to do science). As you have
correctly inferred, the more processors you hang off the
nearest-neighbor network, the worse the situation becomes.

I can't think of why this wouldn't apply in general but don't claim
that
it is true. It just seems so to me (although the rate of decrease
wouldn't necessarily be square root)

Unless you increase the aggregate bandwidth, you reach a point of
diminishing returns. The special nature of Linpack has allowed
unimaginative bureacrats to make a career out of buying and touting
very limited machines that are the very opposite of being scalable.
"Scalability" does not mean more processors or real estate. It means
the ability to use the millionth processor as effectively as you use
the 65th. Genuine scalability is hard, which is why no one is really
bothering with it.

Apparently no one with money is interested in solving these special
problems for which clusters are not good enough. See SSI and steve
Chen, history of.

The problems aren't as special as you think. In fact, the glaring
problem that I've pointed out with machines that rely on local
differencing isn't agenda or marketing driven, it's an unavoidable
mathematical fact. As things stand now, we will have ever more
transistors chuffing away on generating ever-less reliable results.

The problem is this: if you use a sufficiently low-order differencing
scheme, you can do most of the problems of mathematical physics on a
box like Blue Gene. Low order schemes are easy to code, undemanding
with regard to non-local bandwidth, and usually much more stable than
very high-order schemes. If you want to figure out how to place an
air-conditioner, they're just fine. If you're trying to do physics,
the plots you produce will be plausible and beautiful, but very often
wrong.

There is an out that, in fairness, I should mention. If you have
processors to burn, you can always overresolve the problem to the point
where the renormalization problem I've mentioned, while still there,
becomes unimportant. Early results by the biggest ego in the field at
the time suggested that it takes about ten times the resolution to do
fluid mechanics with local differencing as accurately as you can do it
with a pseudospectral scheme. In 3-D, that's a thousand times more
processors. For fair comparison, the number of processors in Livermore
box would be divided by 1000 to get equivalent performance to a box
that could do a decent FFT.

Should be posting to comp.arch so people there can switch from being
experts on computer architecture to being experts on numerical analysis
and mathematical physics.

Robert.
If I recall red storm correctly, it was a hypercube so had same problem
as blue gene.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
AMD Athlon 64 3500+ Venice vs. Manchester core?	DRS	Overclocking AMD Processors	2	May 26th 06 01:22 PM
Athlon 64 Dual or Single Core ?	Magnusfarce	Homebuilt PC's	7	October 30th 05 12:32 AM
the inquierer posting a little news about new core	ewan	Nvidia Videocards	0	February 7th 05 05:54 PM
Quad Cpu Mobo with Dual Core CPUS how fast would that be ?	We Live for the One we Die for the One	General	0	June 14th 04 10:16 PM
CPU Core Voltage Too Low -> Crash?	Edward J. Neth	Gateway Computers	27	February 22nd 04 04:38 AM