Smart memory hubs being proposed

#11 December 28th 03, 09:20 PM

On Sun, 28 Dec 2003 18:08:07 GMT, Robert Redelmeier
wrote:

In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:
My only point was that latency still matters. Superficial examination
of early results from the HP Superdome showed that Itanium is
apparently not very tolerant of increased latency, and HP engineers

Oh, fully agreed. For some apps, latency is _everything_
(linked-lists, TP dB). If the app hopscotches randomly
thru RAM memory (SETI?) nothing else matters much.

Modern systems have done wonders to deliver bandwidth.
Dual channell DDR at high clocks. But has much been done
to improve latency from ~130 ns? (old number)

Close enough.

I thought the main idea behind on-CPU memory controllers
was to reduce this to ~70 ns by reduced bufferin/queuing.

And it has.

A smart hub might be able to detect patterns like 2-4-6-8,
4-8-16-20-24 or 5-4-3-2-1 but cannot possibly do anything
with data-driven pseudo-randoms except add latency.

As David Wang has shrewdly observed, you lose alot of information once
you are outside the processor. All you have left is the history of
memory requests. How about a Bayesian network to try to infer the
underlying pattern? Lame joke. Doesn't even warrant a smiley.

Itanium currently retires instructions in order. Sooner or later,
Intel has to do something for Itanium other than to increase the cache
size.

Are you suggesting Out-of-Order retirement???
Intriguing possibility with a new arch.

Just another example of what another poster in another group would
call my non-standard use of language. I had already started using the
word retirement and stuck with it for no better reason than that I had
already started using it. I'm not making any bold new proposals for
computer architecture. Just at the moment, my brain is frazzled from
trying to consume an entire branch of mathematics in a very short
time, so I wouldn't recognize a good new idea if I saw one.

On the other hand, there is no reason why the only information to make
it across a memory hub has to be memory requests, and there is no
reason why the only thing a memory location knows about itself is that
corresponds to a particular address.

Of course, SMT is just a different solution -- keep the CPU
busy with other work during the ~300 clock read stalls.
Good if there are parallel threads/tasks. Useless if not.

There is aonther way to use SMT, which is to excecute speculative
slices, and there are papers in the literature for simulated-Itania
that show a dramatic improvement. Until we see the details of how SMT
is implemented in Montecito, it won't be obvious whether SMT can
actually be used that way in Montecito or not. If it can, it goes a
long way toward making up for a lack of true run-time scheduling,
since the speculative slice, whose only purpose in life is to trigger
memory requests, is operating in the actual run-time environment, not
one assumed by the compiler.

RM

#12 December 28th 03, 10:21 PM

Robert Myers wrote:

On Sun, 28 Dec 2003 19:26:26 GMT, "Yousuf Khan"
wrote:

"Robert Myers" wrote in message
. ..

The 6mb cache is an act of desperation on Intel's part. I don't
_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

What's COMA?

Cache-only memory architecture. The original Cray's were effectively
COMA because Seymour used for main memory what everybody else used for
cache. That's why some three-letter-agencies with no use for vector
architectures bought the machines.

Intel will, I gather, move the memory controller onto the die. Other
than that, the strategy of the day (and for the forseeable future) is
to hide latency, not to address it directly.

Yes, but AMD is also proposing something similar, and they've already moved
the memory controller onboard.

Geez, Yousuf, not _everything_ is Intel vs. AMD. ;-). Sometimes a
technical issue is just a technical issue.

I cannot for the life of me get inside the head of whoever makes the
technical calls at Intel, because Intel seems to want to do everything
the hard way. Why, I do not know.

I've wondered whether they're groping around for something they can
patent -- obvious and previously tried solutions don't meet that
criterion, leaving "the hard way" perhaps the preferred way from
their standpoint.

As it happens, Intel's bone-headed approach to computer architecture
works well enough for the kinds of problems I am most interested in,
which involve doing the same thing over and over again in ways that
are stupefyingly predictable and you just want to find a way to do it
very fast. I've often wondered if the secret of the origins of the
Itanium architecture isn't that the engineers who designed it didn't
adequately take into account that most of the world isn't doing
technical computing. That, and the fact that nothing works really
well for the applications that matters the most, which is OLTP
(on-line transaction processing).

Itanium happens to interest me also as an intellectual sandbox in
which I can come to grips with things that may be completely obvious
to some people, but not to me. It does well enough for the problems
that interest me, and over the long haul, I expect Intel's bulldozer
approach to architecture and marketing to win. Those things together
are why you think I am an Itanium bigot.

RM

--
After being targeted with gigabytes of trash by the "SWEN" worm, I have
concluded we must conceal our e-mail address. Our true address is the
mirror image of what you see before the "@" symbol. It's a shame such
steps are necessary. ...Charlie

#13 December 28th 03, 10:45 PM

"Robert Myers" wrote in message
...
Of course, SMT is just a different solution -- keep the CPU
busy with other work during the ~300 clock read stalls.
Good if there are parallel threads/tasks. Useless if not.

There is aonther way to use SMT, which is to excecute speculative
slices, and there are papers in the literature for simulated-Itania
that show a dramatic improvement. Until we see the details of how SMT
is implemented in Montecito, it won't be obvious whether SMT can
actually be used that way in Montecito or not. If it can, it goes a
long way toward making up for a lack of true run-time scheduling,
since the speculative slice, whose only purpose in life is to trigger
memory requests, is operating in the actual run-time environment, not
one assumed by the compiler.

That's an interesting way of using SMT, but I suspect we won't see such a
sophisticated use of SMT until at least 65nm, possibly 45nm.

SMT in the form of P4's Hyperthreading was done without really adding too
many transistors. However, it looks like any other architectures if they
want to implement SMT will need to add to the transistor count. I hear that
the IBM Power5 will implement SMT, and that it's added 25% to their
transistor count. Probably more of a reflection about the IPC inefficiency
of the P4 architecture than a reflection on Power5's.

Yousuf Khan

#14 December 28th 03, 10:58 PM

On Sun, 28 Dec 2003 22:21:43 GMT, CJT wrote:

Robert Myers wrote:

snip

I cannot for the life of me get inside the head of whoever makes the
technical calls at Intel, because Intel seems to want to do everything
the hard way. Why, I do not know.

I've wondered whether they're groping around for something they can
patent -- obvious and previously tried solutions don't meet that
criterion, leaving "the hard way" perhaps the preferred way from
their standpoint.

That is probably the correct explanation.

RM

#15 December 29th 03, 02:00 AM

"CJT" wrote in message
...
I've wondered whether they're groping around for something they can
patent -- obvious and previously tried solutions don't meet that
criterion, leaving "the hard way" perhaps the preferred way from
their standpoint.

Yeesh, if that were the case at Intel, I wonder if they have teams of
managers reviewing and shooting down ideas that are too radical, yet not
proprietary enough? :-)

Yousuf Khan

#16 December 29th 03, 04:57 AM

"Yousuf Khan" wrote in
message le.rogers.com...

....

SMT in the form of P4's Hyperthreading was done without really adding too
many transistors. However, it looks like any other architectures if they
want to implement SMT will need to add to the transistor count.

IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not sure
which) was only a few percent.

I hear that
the IBM Power5 will implement SMT, and that it's added 25% to their
transistor count.

I've seen that number as well and am curious what it actually refers to
(given the EV8 experience plus comments from the SMT researchers at UWash
about the minimal added chip-area costs of SMT). It's possible that Px's
use of instruction groups aggravates the problem, or that IBM is quoting the
impact of side-effects rather than just SMT per se (e.g., additional cache
to accommodate the increased use by having additional threads), or that IBM
is referring only to the impact on the size of the processor core rather
than to the overall chip area (which includes not only significant amounts
of L2 cache but memory control and inter-chip routing logic plus, for P5,
reportedly some kinds of on-chip offload engines for specific tasks).

- bill

#17 December 29th 03, 06:19 AM

"Bill Todd" wrote in message
...
IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not sure
which) was only a few percent.

Perhaps Alpha is closest in philosophy to P4, except from an earlier
generation? That is high frequencies, but low IPCs. Afterall, Alpha was the
Mhz king of processors for years prior to the crown being taken over by x86
processors. During Alpha's reign on the Mhz pile, its contemporaries (Sparc,
MIPS, Power, PA-RISC, etc.) seemed to be relatively competitive still,
despite not producing the high Mhz that Alpha did.

I hear that
the IBM Power5 will implement SMT, and that it's added 25% to their
transistor count.

I've seen that number as well and am curious what it actually refers to
(given the EV8 experience plus comments from the SMT researchers at UWash
about the minimal added chip-area costs of SMT). It's possible that Px's
use of instruction groups aggravates the problem, or that IBM is quoting
the
impact of side-effects rather than just SMT per se (e.g., additional cache
to accommodate the increased use by having additional threads), or that
IBM
is referring only to the impact on the size of the processor core rather
than to the overall chip area (which includes not only significant amounts
of L2 cache but memory control and inter-chip routing logic plus, for P5,
reportedly some kinds of on-chip offload engines for specific tasks).

I don't have that information, but I was also just working from the
assumption that they were talking about a 25% increase in the size of just
the inner core, not the overall die size.

Yousuf Khan

#18 December 29th 03, 06:37 AM

"Yousuf Khan" wrote in message
.cable.rogers.com...
"Bill Todd" wrote in message
...
IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not
sure
which) was only a few percent.

Perhaps Alpha is closest in philosophy to P4, except from an earlier
generation? That is high frequencies, but low IPCs.

Nope. While that characterization might have had some validity in early
Alphas, by the time EV6 appeared Alpha's IPC was competitive with anyone's
(and better than most) - and, unfortunately, soon thereafter Compaq lost
interest in pushing Alpha clock-rates up (after Capellas took over and
reversed Pfeiffer's intention to market Alpha against the expected Itanic)
so the *only* thing that kept Alpha ahead of the pack was its IPC (until it
fell a full process generation behind as well more recently).

EV8, by virtue of its 8-way issue and even greater number of in-flight
instructions, would have had significantly better IPC than the rest of the
world - leaving aside the impact of SMT on effective IPC.

....

I hear that
the IBM Power5 will implement SMT, and that it's added 25% to their
transistor count.

I've seen that number as well and am curious what it actually refers to
(given the EV8 experience plus comments from the SMT researchers at
UWash
about the minimal added chip-area costs of SMT). It's possible that
Px's
use of instruction groups aggravates the problem, or that IBM is quoting
the
impact of side-effects rather than just SMT per se (e.g., additional
cache
to accommodate the increased use by having additional threads), or that
IBM
is referring only to the impact on the size of the processor core rather
than to the overall chip area (which includes not only significant
amounts
of L2 cache but memory control and inter-chip routing logic plus, for
P5,
reportedly some kinds of on-chip offload engines for specific tasks).

I don't have that information, but I was also just working from the
assumption that they were talking about a 25% increase in the size of just
the inner core, not the overall die size.

Depending on what the EV8 percentages referred to, that might be possible:
my impression is that the POWERx core itself is pretty compact.

- bill

#19 December 30th 03, 01:23 AM

In article t2JHb.192556$ea%.1351
@news01.bloor.is.net.cable.rogers.com,
says...

SMT in the form of P4's Hyperthreading was done without really adding too
many transistors. However, it looks like any other architectures if they
want to implement SMT will need to add to the transistor count. I hear that
the IBM Power5 will implement SMT, and that it's added 25% to their
transistor count. Probably more of a reflection about the IPC inefficiency
of the P4 architecture than a reflection on Power5's.

Me thinks apples oranges.

--
Keith

#20 December 30th 03, 02:07 AM

On Sun, 28 Dec 2003 18:08:07 GMT, Robert Redelmeier
wrote:
In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:
My only point was that latency still matters. Superficial examination
of early results from the HP Superdome showed that Itanium is
apparently not very tolerant of increased latency, and HP engineers

Oh, fully agreed. For some apps, latency is _everything_
(linked-lists, TP dB). If the app hopscotches randomly
thru RAM memory (SETI?) nothing else matters much.

Modern systems have done wonders to deliver bandwidth.
Dual channell DDR at high clocks. But has much been done
to improve latency from ~130 ns? (old number)

Actually yes, though it wasn't anything obvious. More just that
reducing latency was a major emphasis of current memory controllers.
Intel and nVidia were the first to get it right, and they both did a
bang-up job with their nForce2 and i875 chipsets respectively.
Latency numbers have dropped down to ~100ns on both chipsets (though
I've seen all sorts of different latency numbers depending on just how
this is being measured).

I thought the main idea behind on-CPU memory controllers
was to reduce this to ~70 ns by reduced bufferin/queuing.

On-chip memory controllers reduces latency in a few ways, and it
works. Even with the greatly improved memory controllers from nVidia
and Intel (and now SiS and VIA have more or less caught up), the
Athlon64 and Opteron still have noticeably less latency. In fact,
even with registered memory the Opteron has lower latency than a P4
with unbuffered memory.

Unfortunately there is only so much that can be done here. When you
get right down to it, DRAM has high latency, and nothing you do on the
memory controller side of things can change that. The real solution
to latency is to replace DRAM with something new.

Itanium currently retires instructions in order. Sooner or later,
Intel has to do something for Itanium other than to increase the cache
size.

Are you suggesting Out-of-Order retirement???
Intriguing possibility with a new arch.

I think he's merely suggesting out-of-order execution. I don't know
how well this would work with the IA-64 instruction set, but I suppose
it should be possible.

-------------
Tony Hill
hilla underscore 20 at yahoo dot ca

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PC2100 versus PC2700	Marc Guyott	Asus Motherboards	3	January 20th 05 02:32 AM
Aggressive memory settings questions	Howie	Asus Motherboards	4	November 6th 04 07:29 PM
I still don't completly understand FSB....	legion	Homebuilt PC's	7	October 28th 04 03:20 AM
"Out Of Memory error when trying to start a program or while program is running"	Dharmarajan.K	General Hardware	0	June 11th 04 10:42 PM
Disk to disk copying with overclocked memory	JT	General	30	March 21st 04 02:22 AM