Smart memory hubs being proposed

#1 December 27th 03, 09:56 PM

Both AMD and Intel are proposing a separate but similar new approach to
memory interconnection design for the future. They are dubbing it smart
memory hubs right now, but the details are a little sketchy. It involves
putting some sort of intelligence right into the memory modules.

http://www.eet.com/semi/news/OEG20030508S0023

The initial efforts are aimed at increasing memory density in servers. I'm
not sure how exactly these hubs are supposed to be "smart". I also fail to
see how adding another layer of circuitry in between the memory controller
and memory itself would speed up memory accesses, since it adds another hop
into the equation. However, perhaps these are the successors to the current
SPD ROM that is implanted on every DIMM to describe its architecture to the
memory controller on initialization? Perhaps these hubs send additional
information that SPDs can't send by themselves?

Yousuf Khan

#2 December 27th 03, 11:55 PM

On Sat, 27 Dec 2003 21:56:56 GMT, "Yousuf Khan"
wrote:

Both AMD and Intel are proposing a separate but similar new approach to
memory interconnection design for the future. They are dubbing it smart
memory hubs right now, but the details are a little sketchy. It involves
putting some sort of intelligence right into the memory modules.

http://www.eet.com/semi/news/OEG20030508S0023

The initial efforts are aimed at increasing memory density in servers. I'm
not sure how exactly these hubs are supposed to be "smart". I also fail to
see how adding another layer of circuitry in between the memory controller
and memory itself would speed up memory accesses, since it adds another hop
into the equation. However, perhaps these are the successors to the current
SPD ROM that is implanted on every DIMM to describe its architecture to the
memory controller on initialization? Perhaps these hubs send additional
information that SPDs can't send by themselves?

Yousuf Khan

FB-DIMMs....Might be a lot less there than meets the eye of the article.

FB-DIMMs translate a narrow but very fast memory interconnect into ddr2 sdram
transactions, with each FB-Dimm having an asic (the "hub") doing all of the
things discrete registers and plls used to do - PLUS the memory interconnect
actually passes through the hub on one dimm to get to the next dimm/hub,
through that one to the next, and so on. It's quite extensible, which
addresses the problem of hooking a bunch of dimms to *anything* these days
while maintaining interconnect speed.

Note, however, that memory latency is clearly not addressed in a positive
manner - sticking n pass-thru elements between the nth dimm's drams and the
host chipset rarely results in quicker memory response ;-)

One can surmise the era of (up to) 6MB on-chip caches is expected to reduce
typical miss ratios down to where the even-longer-than-before latency isn't a
significant hit to overall platform performance...

And in any case, some powerful marketing forces will be brought to bear to
discourage any thoughts of "This is another iRDRAM marketing disaster waiting
to happen"...

/daytripper (wait for it ;-)

#3 December 28th 03, 02:02 PM

On Sat, 27 Dec 2003 23:55:30 GMT, daytripper
wrote:

snip

FB-DIMMs....Might be a lot less there than meets the eye of the article.

FB-DIMMs translate a narrow but very fast memory interconnect into ddr2 sdram
transactions, with each FB-Dimm having an asic (the "hub") doing all of the
things discrete registers and plls used to do - PLUS the memory interconnect
actually passes through the hub on one dimm to get to the next dimm/hub,
through that one to the next, and so on. It's quite extensible, which
addresses the problem of hooking a bunch of dimms to *anything* these days
while maintaining interconnect speed.

Presumably solving the problems inherent in a multi-drop bus?

Note, however, that memory latency is clearly not addressed in a positive
manner - sticking n pass-thru elements between the nth dimm's drams and the
host chipset rarely results in quicker memory response ;-)

One can surmise the era of (up to) 6MB on-chip caches is expected to reduce
typical miss ratios down to where the even-longer-than-before latency isn't a
significant hit to overall platform performance...

The 6mb cache is an act of desperation on Intel's part. I don't
_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

Intel will, I gather, move the memory controller onto the die. Other
than that, the strategy of the day (and for the forseeable future) is
to hide latency, not to address it directly.

RM

#4 December 28th 03, 03:32 PM

In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:
The 6mb cache is an act of desperation on Intel's part. I don't

Agreed. yet ...

_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

Then how do you explain the _dismal_ performance of the
Celeron4 with only 128 KB L2 and poor showing of the first
P4 with 256 versus the current P4 at 512 KB? These are
all the same P7 core with the same small L1s.

I can't blame Intel for wanting to try more cache.
This is obviously a game of diminishing returns, and the
P4EE seems to be past. 512 KB seems optimal for current
datasets/problems/benchmarques. Cache MATTERS.

Notice also how the AMD K7 improved from 256 to 512.
The Duron, with the tiny 64 KB L2 performs amazingly well.
Decent L1s and the excellent organization of L2 (16 way,
exclusive) saves it from the Celeron4's fate.

-- Robert

#5 December 28th 03, 05:01 PM

On Sun, 28 Dec 2003 15:32:00 GMT, Robert Redelmeier
wrote:

In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:
The 6mb cache is an act of desperation on Intel's part. I don't

Agreed. yet ...

_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

Then how do you explain the _dismal_ performance of the
Celeron4 with only 128 KB L2 and poor showing of the first
P4 with 256 versus the current P4 at 512 KB? These are
all the same P7 core with the same small L1s.

I can't blame Intel for wanting to try more cache.
This is obviously a game of diminishing returns, and the
P4EE seems to be past. 512 KB seems optimal for current
datasets/problems/benchmarques. Cache MATTERS.

Notice also how the AMD K7 improved from 256 to 512.
The Duron, with the tiny 64 KB L2 performs amazingly well.
Decent L1s and the excellent organization of L2 (16 way,
exclusive) saves it from the Celeron4's fate.

Well of course cache matters, and if the latency is fixed, the
increase in cache size with the speed at which you are retiring
instructions (not clock speed) has to be superlinear, no matter how
you get there. That is to say, cache size will keep increasing,
assuming that processors are able to retire instructions at increasing
speeds.

My only point was that latency still matters. Superficial examination
of early results from the HP Superdome showed that Itanium is
apparently not very tolerant of increased latency, and HP engineers
with whom I've corresponded have not disagreed; i.e, there is a
substantial payoff to be had from a better memory subsystem.

I don't think it needs to be explained to you, but I will make the
point anyway: increased cache does no good if you have no way of
triggering memory fetches far enough ahead of time to make use of the
cache. An OoO processor can just juggle more instructions, but
Itanium currently retires instructions in order. Sooner or later,
Intel has to do something for Itanium other than to increase the cache
size.

RM

#6 December 28th 03, 06:08 PM

In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:
My only point was that latency still matters. Superficial examination
of early results from the HP Superdome showed that Itanium is
apparently not very tolerant of increased latency, and HP engineers

Oh, fully agreed. For some apps, latency is _everything_
(linked-lists, TP dB). If the app hopscotches randomly
thru RAM memory (SETI?) nothing else matters much.

Modern systems have done wonders to deliver bandwidth.
Dual channell DDR at high clocks. But has much been done
to improve latency from ~130 ns? (old number)

I thought the main idea behind on-CPU memory controllers
was to reduce this to ~70 ns by reduced bufferin/queuing.

A smart hub might be able to detect patterns like 2-4-6-8,
4-8-16-20-24 or 5-4-3-2-1 but cannot possibly do anything
with data-driven pseudo-randoms except add latency.

Itanium currently retires instructions in order. Sooner or later,
Intel has to do something for Itanium other than to increase the cache
size.

Are you suggesting Out-of-Order retirement???
Intriguing possibility with a new arch.

Of course, SMT is just a different solution -- keep the CPU
busy with other work during the ~300 clock read stalls.
Good if there are parallel threads/tasks. Useless if not.

-- Robert

#7 December 28th 03, 06:19 PM

"Robert Redelmeier" wrote in message
m...
In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:
The 6mb cache is an act of desperation on Intel's part. I don't

Agreed. yet ...

_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

Then how do you explain the _dismal_ performance of the
Celeron4 with only 128 KB L2

Market segmentation: Celeron isn't *meant* to perform at levels comparable
to Pentium - else why would people shell out more for the latter?

and poor showing of the first
P4 with 256 versus the current P4 at 512 KB?

Compilers have gotten a lot better at optimizing for P4 too over the past
couple of years - the difference from the early P4s is not *just* cache
size.

These are
all the same P7 core with the same small L1s.

The above doesn't necessarily mean that P4 may not be somewhat more
sensitive to cache size than its predecessor - but it clearly doesn't
require many MB of cache to perform well, unlike Itanic.

....

Notice also how the AMD K7 improved from 256 to 512.

Doubling cache size usually helps. But doubling cache size from 256 KB to
512 KB is a hell of a lot less expensive (in terms of chip area) than
doubling cache size from 6 MB to 12 MB.

The Duron, with the tiny 64 KB L2 performs amazingly well.
Decent L1s and the excellent organization of L2 (16 way,
exclusive) saves it from the Celeron4's fate.

Er, no: having 128 KB of L1 cache plus an exclusive L2 that makes the total
cache size effectively 192 KB (vs. the older Athlon's effective cache size
of 128 KB + 256 KB = 384 KB), plus significantly better IPC, is what saves
it from being a dud like Celeron.

- bill

#8 December 28th 03, 07:26 PM

"Robert Myers" wrote in message
...
The 6mb cache is an act of desperation on Intel's part. I don't
_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

What's COMA?

Intel will, I gather, move the memory controller onto the die. Other
than that, the strategy of the day (and for the forseeable future) is
to hide latency, not to address it directly.

Yes, but AMD is also proposing something similar, and they've already moved
the memory controller onboard.

Yousuf Khan

#9 December 28th 03, 08:21 PM

Robert Redelmeier wrote:

In comp.sys.ibm.pc.hardware.chips Robert Myers wrote:

The 6mb cache is an act of desperation on Intel's part. I don't

Agreed. yet ...

_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

Then how do you explain the _dismal_ performance of the
Celeron4 with only 128 KB L2 and poor showing of the first
P4 with 256 versus the current P4 at 512 KB? These are
all the same P7 core with the same small L1s.

I can't blame Intel for wanting to try more cache.
This is obviously a game of diminishing returns, and the
P4EE seems to be past. 512 KB seems optimal for current
datasets/problems/benchmarques. Cache MATTERS.

The non-Intel crowd has known that for years. But cache is
also expensive.

Notice also how the AMD K7 improved from 256 to 512.
The Duron, with the tiny 64 KB L2 performs amazingly well.
Decent L1s and the excellent organization of L2 (16 way,
exclusive) saves it from the Celeron4's fate.

-- Robert

--
After being targeted with gigabytes of trash by the "SWEN" worm, I have
concluded we must conceal our e-mail address. Our true address is the
mirror image of what you see before the "@" symbol. It's a shame such
steps are necessary. ...Charlie

#10 December 28th 03, 09:00 PM

On Sun, 28 Dec 2003 19:26:26 GMT, "Yousuf Khan"
wrote:

"Robert Myers" wrote in message
.. .
The 6mb cache is an act of desperation on Intel's part. I don't
_think_ their strategy is to keep increasing cache size. It's a
losing strategy, anyway, unless you go to COMA. Itanium's in-order
architecture is just too inflexible, and the problem is still cache
misses.

What's COMA?

Cache-only memory architecture. The original Cray's were effectively
COMA because Seymour used for main memory what everybody else used for
cache. That's why some three-letter-agencies with no use for vector
architectures bought the machines.

Intel will, I gather, move the memory controller onto the die. Other
than that, the strategy of the day (and for the forseeable future) is
to hide latency, not to address it directly.

Yes, but AMD is also proposing something similar, and they've already moved
the memory controller onboard.

Geez, Yousuf, not _everything_ is Intel vs. AMD. ;-). Sometimes a
technical issue is just a technical issue.

I cannot for the life of me get inside the head of whoever makes the
technical calls at Intel, because Intel seems to want to do everything
the hard way. Why, I do not know.

As it happens, Intel's bone-headed approach to computer architecture
works well enough for the kinds of problems I am most interested in,
which involve doing the same thing over and over again in ways that
are stupefyingly predictable and you just want to find a way to do it
very fast. I've often wondered if the secret of the origins of the
Itanium architecture isn't that the engineers who designed it didn't
adequately take into account that most of the world isn't doing
technical computing. That, and the fact that nothing works really
well for the applications that matters the most, which is OLTP
(on-line transaction processing).

Itanium happens to interest me also as an intellectual sandbox in
which I can come to grips with things that may be completely obvious
to some people, but not to me. It does well enough for the problems
that interest me, and over the long haul, I expect Intel's bulldozer
approach to architecture and marketing to win. Those things together
are why you think I am an Itanium bigot.

RM

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
I still don't completly understand FSB....	legion	Homebuilt PC's	7	October 28th 04 03:20 AM
Critical errors? ATA Error Count	Al Bogner	Storage (alternative)	0	June 13th 04 12:14 PM
"Out Of Memory error when trying to start a program or while program is running"	Dharmarajan.K	General Hardware	0	June 11th 04 10:42 PM
Intel COO signals willingness to go with AMD64!!	Yousuf Khan	General	136	February 16th 04 10:31 PM
hdd dying ?	Burzek	Storage (alternative)	3	January 28th 04 10:56 AM