Smart memory hubs being proposed

#11 December 28th 03, 09:20 PM

On Sun, 28 Dec 2003 18:08:07 GMT, Robert Redelmeier
wrote:

In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:
My only point was that latency still matters. Superficial examination
of early results from the HP Superdome showed that Itanium is
apparently not very tolerant of increased latency, and HP engineers

Oh, fully agreed. For some apps, latency is _everything_
(linked-lists, TP dB). If the app hopscotches randomly
thru RAM memory (SETI?) nothing else matters much.

Modern systems have done wonders to deliver bandwidth.
Dual channell DDR at high clocks. But has much been done
to improve latency from ~130 ns? (old number)

Close enough.

I thought the main idea behind on-CPU memory controllers
was to reduce this to ~70 ns by reduced bufferin/queuing.

And it has.

A smart hub might be able to detect patterns like 2-4-6-8,
4-8-16-20-24 or 5-4-3-2-1 but cannot possibly do anything
with data-driven pseudo-randoms except add latency.

As David Wang has shrewdly observed, you lose alot of information once
you are outside the processor. All you have left is the history of
memory requests. How about a Bayesian network to try to infer the
underlying pattern? Lame joke. Doesn't even warrant a smiley.

Itanium currently retires instructions in order. Sooner or later,
Intel has to do something for Itanium other than to increase the cache
size.

Are you suggesting Out-of-Order retirement???
Intriguing possibility with a new arch.

Just another example of what another poster in another group would
call my non-standard use of language. I had already started using the
word retirement and stuck with it for no better reason than that I had
already started using it. I'm not making any bold new proposals for
computer architecture. Just at the moment, my brain is frazzled from
trying to consume an entire branch of mathematics in a very short
time, so I wouldn't recognize a good new idea if I saw one.

On the other hand, there is no reason why the only information to make
it across a memory hub has to be memory requests, and there is no
reason why the only thing a memory location knows about itself is that
corresponds to a particular address.

Of course, SMT is just a different solution -- keep the CPU
busy with other work during the ~300 clock read stalls.
Good if there are parallel threads/tasks. Useless if not.

There is aonther way to use SMT, which is to excecute speculative
slices, and there are papers in the literature for simulated-Itania
that show a dramatic improvement. Until we see the details of how SMT
is implemented in Montecito, it won't be obvious whether SMT can
actually be used that way in Montecito or not. If it can, it goes a
long way toward making up for a lack of true run-time scheduling,
since the speculative slice, whose only purpose in life is to trigger
memory requests, is operating in the actual run-time environment, not
one assumed by the compiler.

RM

#12 December 28th 03, 10:21 PM

Robert Myers wrote:

On Sun, 28 Dec 2003 19:26:26 GMT, "Yousuf Khan"
wrote:

"Robert Myers" wrote in message
. ..

The 6mb cache is an act of desperation on Intel's part. I don't
_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

What's COMA?

Cache-only memory architecture. The original Cray's were effectively
COMA because Seymour used for main memory what everybody else used for
cache. That's why some three-letter-agencies with no use for vector
architectures bought the machines.

Intel will, I gather, move the memory controller onto the die. Other
than that, the strategy of the day (and for the forseeable future) is
to hide latency, not to address it directly.

Yes, but AMD is also proposing something similar, and they've already moved
the memory controller onboard.

Geez, Yousuf, not _everything_ is Intel vs. AMD. ;-). Sometimes a
technical issue is just a technical issue.

I cannot for the life of me get inside the head of whoever makes the
technical calls at Intel, because Intel seems to want to do everything
the hard way. Why, I do not know.

I've wondered whether they're groping around for something they can
patent -- obvious and previously tried solutions don't meet that
criterion, leaving "the hard way" perhaps the preferred way from
their standpoint.

As it happens, Intel's bone-headed approach to computer architecture
works well enough for the kinds of problems I am most interested in,
which involve doing the same thing over and over again in ways that
are stupefyingly predictable and you just want to find a way to do it
very fast. I've often wondered if the secret of the origins of the
Itanium architecture isn't that the engineers who designed it didn't
adequately take into account that most of the world isn't doing
technical computing. That, and the fact that nothing works really
well for the applications that matters the most, which is OLTP
(on-line transaction processing).

Itanium happens to interest me also as an intellectual sandbox in
which I can come to grips with things that may be completely obvious
to some people, but not to me. It does well enough for the problems
that interest me, and over the long haul, I expect Intel's bulldozer
approach to architecture and marketing to win. Those things together
are why you think I am an Itanium bigot.

RM

--
After being targeted with gigabytes of trash by the "SWEN" worm, I have
concluded we must conceal our e-mail address. Our true address is the
mirror image of what you see before the "@" symbol. It's a shame such
steps are necessary. ...Charlie

#13 December 28th 03, 10:45 PM

"Robert Myers" wrote in message
...
Of course, SMT is just a different solution -- keep the CPU
busy with other work during the ~300 clock read stalls.
Good if there are parallel threads/tasks. Useless if not.

There is aonther way to use SMT, which is to excecute speculative
slices, and there are papers in the literature for simulated-Itania
that show a dramatic improvement. Until we see the details of how SMT
is implemented in Montecito, it won't be obvious whether SMT can
actually be used that way in Montecito or not. If it can, it goes a
long way toward making up for a lack of true run-time scheduling,
since the speculative slice, whose only purpose in life is to trigger
memory requests, is operating in the actual run-time environment, not
one assumed by the compiler.

That's an interesting way of using SMT, but I suspect we won't see such a
sophisticated use of SMT until at least 65nm, possibly 45nm.

SMT in the form of P4's Hyperthreading was done without really adding too
many transistors. However, it looks like any other architectures if they
want to implement SMT will need to add to the transistor count. I hear that
the IBM Power5 will implement SMT, and that it's added 25% to their
transistor count. Probably more of a reflection about the IPC inefficiency
of the P4 architecture than a reflection on Power5's.

Yousuf Khan

#14 December 28th 03, 10:58 PM

On Sun, 28 Dec 2003 22:21:43 GMT, CJT wrote:

Robert Myers wrote:

snip

I cannot for the life of me get inside the head of whoever makes the
technical calls at Intel, because Intel seems to want to do everything
the hard way. Why, I do not know.

I've wondered whether they're groping around for something they can
patent -- obvious and previously tried solutions don't meet that
criterion, leaving "the hard way" perhaps the preferred way from
their standpoint.

That is probably the correct explanation.

RM

#15 December 29th 03, 02:00 AM

"CJT" wrote in message
...
I've wondered whether they're groping around for something they can
patent -- obvious and previously tried solutions don't meet that
criterion, leaving "the hard way" perhaps the preferred way from
their standpoint.

Yeesh, if that were the case at Intel, I wonder if they have teams of
managers reviewing and shooting down ideas that are too radical, yet not
proprietary enough? :-)

Yousuf Khan

#16 December 29th 03, 04:57 AM

"Yousuf Khan" wrote in
message le.rogers.com...

....

SMT in the form of P4's Hyperthreading was done without really adding too
many transistors. However, it looks like any other architectures if they
want to implement SMT will need to add to the transistor count.

IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not sure
which) was only a few percent.

I hear that
the IBM Power5 will implement SMT, and that it's added 25% to their
transistor count.

I've seen that number as well and am curious what it actually refers to
(given the EV8 experience plus comments from the SMT researchers at UWash
about the minimal added chip-area costs of SMT). It's possible that Px's
use of instruction groups aggravates the problem, or that IBM is quoting the
impact of side-effects rather than just SMT per se (e.g., additional cache
to accommodate the increased use by having additional threads), or that IBM
is referring only to the impact on the size of the processor core rather
than to the overall chip area (which includes not only significant amounts
of L2 cache but memory control and inter-chip routing logic plus, for P5,
reportedly some kinds of on-chip offload engines for specific tasks).

- bill

#17 January 9th 04, 12:55 AM

Robert Redelmeier wrote:
In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:
My only point was that latency still matters. Superficial examination
of early results from the HP Superdome showed that Itanium is
apparently not very tolerant of increased latency, and HP engineers

Oh, fully agreed. For some apps, latency is _everything_
(linked-lists, TP dB). If the app hopscotches randomly
thru RAM memory (SETI?) nothing else matters much.

Modern systems have done wonders to deliver bandwidth.
Dual channell DDR at high clocks. But has much been done
to improve latency from ~130 ns? (old number)

the Athlon64 3400+ returns memory latency numbers in the region of 45ns,
that's single channel of course, but.. damn.

-JB

#18 January 9th 04, 06:46 AM

On Fri, 9 Jan 2004 00:55:32 +0000 (UTC), "James Boswell"
wrote:

snip

the Athlon64 3400+ returns memory latency numbers in the region of 45ns,
that's single channel of course, but.. damn.

Yeah, that's hot. Intel's approach to latency is expletive-deleted
annoying.

RM

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
I still don't completly understand FSB....	legion	Homebuilt PC's	7	October 28th 04 03:20 AM
Critical errors? ATA Error Count	Al Bogner	Storage (alternative)	0	June 13th 04 12:14 PM
"Out Of Memory error when trying to start a program or while program is running"	Dharmarajan.K	General Hardware	0	June 11th 04 10:42 PM
Intel COO signals willingness to go with AMD64!!	Yousuf Khan	General	136	February 16th 04 10:31 PM
hdd dying ?	Burzek	Storage (alternative)	3	January 28th 04 10:56 AM