Itanium Montecito stuff

#1 November 16th 03, 04:50 PM

Multicore, symettric multi-threading, and 24MB of cache. Looks like this one
was designed with help from the Alpha team that Intel just bought out
recently from HPaq.

Yousuf Khan

http://www.theinquirer.net/?article=12686

#2 November 16th 03, 05:03 PM

Yousuf Khan wrote:

Multicore, symettric multi-threading, and 24MB of cache. Looks like this one
was designed with help from the Alpha team that Intel just bought out
recently from HPaq.

Yousuf Khan

http://www.theinquirer.net/?article=12686

24 Megs of high-speed SRAM ???

Think $$$!

--

- Peter Perls¿ - web: http://u238.dk

"If you have been voting for politicians who promise to give you goodies
at someone else's expense, then you have no right to complain when they
take your money and give it to someone else, including themselves."

-- Thomas Sowell (1992)

#3 November 16th 03, 05:12 PM

"Peter Perlsø" wrote in message
k...
Multicore, symettric multi-threading, and 24MB of cache. Looks like this
one
was designed with help from the Alpha team that Intel just bought out
recently from HPaq.

24 Megs of high-speed SRAM ???

Think $$$!

Yeah, I'm not even sure why they're dicking around. Just get it over and
done with, put 1GB of SRAM
on it, and get rid of that DRAM already. That would be a feature of the
processor, doesn't need any external RAM. :-)

Yousuf Khan

#4 November 16th 03, 06:44 PM

On Sun, 16 Nov 2003 16:50:50 GMT, "Yousuf Khan"
wrote:

Multicore, symettric multi-threading, and 24MB of cache. Looks like this one
was designed with help from the Alpha team that Intel just bought out
recently from HPaq.

Yousuf Khan

http://www.theinquirer.net/?article=12686

SMT was always aimed at Itanium. You can achieve most of the benefits
of OoO execution without actually going OoO by using SMT helper
threads. If you're supporting two cores with four threads each, the
huge cache is inevitable.

RM

#5 November 16th 03, 08:00 PM

"Robert Myers" wrote in message
...
On Sun, 16 Nov 2003 16:50:50 GMT, "Yousuf Khan"
wrote:

Multicore, symettric multi-threading, and 24MB of cache. Looks like this
one
was designed with help from the Alpha team that Intel just bought out
recently from HPaq.

I kind of doubt that: those people are reportedly all working on
Tanglewood, any Itanic SMT effort aimed at shipping in 2005 would have had
to have started at least a bit before the first of them settled in at Intel,
and while they may have offered comments I suspect that whatever SMT
mechanism may be incorporated into Itanic (I'm still a bit skeptical of this
report, but it does seem to be pretty wide-spread) differs sufficiently at a
very basic level from what they were working on for EV8 that their
experience may not have been directly transferrable.

Yousuf Khan

http://www.theinquirer.net/?article=12686

SMT was always aimed at Itanium.

Really? My impression is that the Itanic architecture was largely
established somewhat before SMT appeared on the horizon, that most of the
coordination by the University of Washington researchers was with DEC and
Alpha, and that SMT is particularly amenable to leveraging existing
mechanisms for out-of-order execution (e.g., in Alpha) that are
conspicuously absent in Itanic.

Intel may later have investigated ways to make use of SMT in Itanic, but I
think it was definitely a retrofit.

You can achieve most of the benefits
of OoO execution without actually going OoO by using SMT helper
threads.

Maybe. But without doubt one of the things that you sacrifice is power
efficiency (not that Itanic appears to worry about this much), since without
the OoO hardware facilities you don't have a clue whether the extra work
you're doing will be useful (and even if it is useful in preloading the
caches, when the *real* code path reaches that point the instructions still
get executed a second time anyway).

Such helper threads are also a lot more expensive in use of execution units
than OoO SMT mechanisms are (again, because of the redundant or useless
execution activity noted above), so you need more EUs (and thus more core
area, which starts to limit clock rates unless you go asynchronous) than
you'd need in an OoO SMT implementation to perform as well.

If you're supporting two cores with four threads each,

Do you have a source for the suggestion that each Montecito core supports 4
threads?

the
huge cache is inevitable.

Not if you're primarily using the SMT for helper threads (not that I'm
suggesting that this as a great idea).

- bill

#6 November 16th 03, 09:02 PM

On Sun, 16 Nov 2003 15:00:21 -0500, "Bill Todd"
wrote:

"Robert Myers" wrote in message
.. .
On Sun, 16 Nov 2003 16:50:50 GMT, "Yousuf Khan"
wrote:

snip

SMT was always aimed at Itanium.

Really? My impression is that the Itanic architecture was largely
established somewhat before SMT appeared on the horizon, that most of the
coordination by the University of Washington researchers was with DEC and
Alpha, and that SMT is particularly amenable to leveraging existing
mechanisms for out-of-order execution (e.g., in Alpha) that are
conspicuously absent in Itanic.

Oh, there I go again.

SMT at _Intel_ was always aimed at Itanium.

Intel may later have investigated ways to make use of SMT in Itanic, but I
think it was definitely a retrofit.

I don't think there's much doubt about that.

You can achieve most of the benefits
of OoO execution without actually going OoO by using SMT helper
threads.

Maybe. But without doubt one of the things that you sacrifice is power
efficiency (not that Itanic appears to worry about this much), since without
the OoO hardware facilities you don't have a clue whether the extra work
you're doing will be useful (and even if it is useful in preloading the
caches, when the *real* code path reaches that point the instructions still
get executed a second time anyway).

I expect helper threads to find a place even in OoO processors. The
available work on prescheduled speculative slices looks very
promising. A helper thread would also make things like DynamoRIO look
more attractive.

Such helper threads are also a lot more expensive in use of execution units
than OoO SMT mechanisms are (again, because of the redundant or useless
execution activity noted above), so you need more EUs (and thus more core
area, which starts to limit clock rates unless you go asynchronous) than
you'd need in an OoO SMT implementation to perform as well.

A paper at SC 2003 suggests that "arithmetic is free, bandwidth is
expensive." If someone else doesn't get there first, I'll post a
thread for discussion. It warrants a separate thread.

If you're supporting two cores with four threads each,

Do you have a source for the suggestion that each Montecito core supports 4
threads?

The paper I cited previously in comp.arch
:
:http://www.cs.ucsd.edu/users/jbrown/papers/sp-cmp.pdf
:
:"Speculative Precomputation on Chip Multiprocessors"
:
:which I gather is from
:
:6th Workshop on Multithreaded Execution, Architecture, and Compilation

MTEAC-6) Tuesday, November 19 (2002) Istanbul, Turkey.
:
:"Figure 2 indicates that across the board, SMT consistently

rovides the greatest speedup of the four configurations
:shown, even though it has the fewest overall execution
:resources and the least amount of aggregate cache capacity."
:
:with the four configurations being 4-way SMT, vs 2, 4, and 8 way CMP.

the
huge cache is inevitable.

Not if you're primarily using the SMT for helper threads (not that I'm
suggesting that this as a great idea).

Scheduling helper threads without a roomy cache is tricky. The whole
purpose is to pull stuff into cache ahead of time, and it would be
annoying to have a helper thread bump something else out of cache that
was needed sooner than what the helper thread just pulled in.

RM

#7 November 17th 03, 12:15 AM

"Robert Myers" wrote in message
...
On Sun, 16 Nov 2003 15:00:21 -0500, "Bill Todd"
wrote:

"Robert Myers" wrote in message
.. .

....

You can achieve most of the benefits
of OoO execution without actually going OoO by using SMT helper
threads.

Maybe. But without doubt one of the things that you sacrifice is power
efficiency (not that Itanic appears to worry about this much), since
without
the OoO hardware facilities you don't have a clue whether the extra work
you're doing will be useful (and even if it is useful in preloading the
caches, when the *real* code path reaches that point the instructions
still
get executed a second time anyway).

I expect helper threads to find a place even in OoO processors.

Possibly, but I suspect only in situations where the workload has fewer
threads than the SMT core supports: otherwise, the other core threads will
likely be far more effective servicing real threads and leaving the
individual thread IPC up to the OoO mechanisms. With Itanic, the trade-off
may be less clear (since it has more to gain on an individual thread from SP
than an OoO core does).

The
available work on prescheduled speculative slices looks very
promising. A helper thread would also make things like DynamoRIO look
more attractive.

Such helper threads are also a lot more expensive in use of execution
units
than OoO SMT mechanisms are (again, because of the redundant or useless
execution activity noted above), so you need more EUs (and thus more core
area, which starts to limit clock rates unless you go asynchronous) than
you'd need in an OoO SMT implementation to perform as well.

A paper at SC 2003 suggests that "arithmetic is free, bandwidth is
expensive."

Free in what respect(s)? The specific context above is power and chip area
(and by extension of the latter clock rate).

If someone else doesn't get there first, I'll post a
thread for discussion. It warrants a separate thread.

If you're supporting two cores with four threads each,

Do you have a source for the suggestion that each Montecito core supports
4
threads?

The paper I cited previously in comp.arch
:
:http://www.cs.ucsd.edu/users/jbrown/papers/sp-cmp.pdf
:
:"Speculative Precomputation on Chip Multiprocessors"
:
:which I gather is from
:
:6th Workshop on Multithreaded Execution, Architecture, and Compilation
MTEAC-6) Tuesday, November 19 (2002) Istanbul, Turkey.
:
:"Figure 2 indicates that across the board, SMT consistently
rovides the greatest speedup of the four configurations
:shown, even though it has the fewest overall execution
:resources and the least amount of aggregate cache capacity."
:
:with the four configurations being 4-way SMT, vs 2, 4, and 8 way CMP.

That paper concentrates on SP in CMP-only environments, and uses the
4-thread SMT core only for comparison purposes. There's nothing in it to
suggest that it refers in any way specifically to Montecito.

the
huge cache is inevitable.

Not if you're primarily using the SMT for helper threads (not that I'm
suggesting that this as a great idea).

Scheduling helper threads without a roomy cache is tricky. The whole
purpose is to pull stuff into cache ahead of time, and it would be
annoying to have a helper thread bump something else out of cache that
was needed sooner than what the helper thread just pulled in.

If that were a serious problem, it would be worst in the extremely small L1
cache and significant in the modest L2 cache. The size of the L3 cache
should be completely insensitive to it by comparison, especially with the
24-way associativity that the current Itanic2 L3 cache has: whatever data
is evicted from the L3 by the helper thread is unlikely to be very
important, whereas the new data that the helper thread is bringing in will
almost certainly be needed almost immediately.

- bill

#8 November 28th 03, 11:37 AM

Yousuf Khan wrote:
"Peter Perlsø" wrote in message
k...
Multicore, symettric multi-threading, and 24MB of cache. Looks like
this one was designed with help from the Alpha team that Intel just
bought out recently from HPaq.

24 Megs of high-speed SRAM ???

Think $$$!

Yeah, I'm not even sure why they're dicking around. Just get it over and
done with, put 1GB of SRAM
on it, and get rid of that DRAM already. That would be a feature of the
processor, doesn't need any external RAM. :-)

Oddly enough, IBM were going on about that..

and on a .045 process, they could probably get a gig of edram in under
200mm^2 of die area, using the 36MB edram dies they've got alongside the
POWER5 as a guide

-JB

#9 November 28th 03, 04:20 PM

James Boswell wrote:

Yousuf Khan wrote:

"Peter Perlsø" wrote in message
.dk...

Multicore, symettric multi-threading, and 24MB of cache. Looks like
this one was designed with help from the Alpha team that Intel just
bought out recently from HPaq.

24 Megs of high-speed SRAM ???

Think $$$!

Yeah, I'm not even sure why they're dicking around. Just get it over and
done with, put 1GB of SRAM
on it, and get rid of that DRAM already. That would be a feature of the
processor, doesn't need any external RAM. :-)

Oddly enough, IBM were going on about that..

and on a .045 process, they could probably get a gig of edram in under
200mm^2 of die area, using the 36MB edram dies they've got alongside the
POWER5 as a guide

-JB

EDRAM

Enhanced Dynamic Random Access Memory
(E-D-ram)

Another form of DRAM that includes an SRAM cache on the chip. This
allows frequently accessed data to be obtained faster. (Also known as
CDRAM.)

Just FYI.

--

- Peter Perls¿ - web: http://u238.dk

"If you have been voting for politicians who promise to give you goodies
at someone else's expense, then you have no right to complain when they
take your money and give it to someone else, including themselves."

-- Thomas Sowell (1992)

#10 November 28th 03, 07:09 PM

In article ,
says...
James Boswell wrote:

Yousuf Khan wrote:

"Peter Perlsø" wrote in message
.dk...

Multicore, symettric multi-threading, and 24MB of cache. Looks like
this one was designed with help from the Alpha team that Intel just
bought out recently from HPaq.

24 Megs of high-speed SRAM ???

Think $$$!

Yeah, I'm not even sure why they're dicking around. Just get it over and
done with, put 1GB of SRAM
on it, and get rid of that DRAM already. That would be a feature of the
processor, doesn't need any external RAM. :-)

Oddly enough, IBM were going on about that..

and on a .045 process, they could probably get a gig of edram in under
200mm^2 of die area, using the 36MB edram dies they've got alongside the
POWER5 as a guide

-JB

EDRAM

Enhanced Dynamic Random Access Memory
(E-D-ram)

Another form of DRAM that includes an SRAM cache on the chip. This
allows frequently accessed data to be obtained faster. (Also known as
CDRAM.)

.... or embedded DRAM.

Just FYI.

Indeed.

--
Keith

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Anyone know any time frame that stuff like PCI express, BTX formfactor is going to be pushed out into the mkt?	[email protected]	General	1	April 28th 04 04:49 AM
Intel COO signals willingness to go with AMD64!!	Yousuf Khan	General	136	February 16th 04 10:31 PM
Itanium Montecito stuff	Yousuf Khan	General	10	November 30th 03 06:20 PM
IBM white paper on Opteron	Yousuf Khan	General	115	November 7th 03 03:04 AM
Supercomputer interconnect technologies, Opteron & Itanium	Yousuf Khan	Intel	4	August 29th 03 12:47 PM