65nm news from Intel

#**221** October 4th 04, 04:29 PM

In comp.sys.ibm.pc.hardware.chips Peter Boyle wrote:
On Mon, 4 Oct 2004, Robert Redelmeier wrote:
Code type matters. SMT is best for continuing work during
the ~300 clock memory fetch latency.

What is the evidence to back up this claim?

Logic. When else can SMT really do net increased work?
If you want to test, run some pointer-chasers.

I would however claim that functional units are almost free,

This is getting more and more true as caches grow, but only
from an areal perspective. A multiplier still sucks back a
huge amount of power and tosses it as heat.

CMP will also help the former [bandwidth].

Nope, not without a second memory bus and all those pins.

-- Robert

#**222** October 4th 04, 04:33 PM

In article k,
Peter Boyle writes:
| On Mon, 4 Oct 2004, Robert Redelmeier wrote:
|
| Code type matters. SMT is best for continuing work during
| the ~300 clock memory fetch latency.
|
| What is the evidence to back up this claim?
|
| Not theories, but _evidence_ of bigger speed up compared to,
| for example, switch on event multi-threading, or CMP with simpler
| and smaller processors, but not sharing L1 cache.
|
| Note that I'm not claiming evidence the other way, but as far as
| I can tell the jury is out on the best organisation for concurrency
| on chip.

I should be happy to see even a theoretical analysis - I wasn't
impressed by Eggers's omission of a comparable CMP for comparison
purposes.

Regards,
Nick Maclaren.

#**223** October 4th 04, 04:46 PM

In article ,
"Peter Dickerson" writes:
|
| I think that Nick is muddled on this one. If the base implementation is
| already OoO then there will normally be many more physical registers than
| architected ones. To go two-way SMT may not involve adding any physical
| registers, but rather involve changes to renaming. "dual port" every
| execution unit doesn't make much sense to me. Access to execution units from
| either virtual processor is essentially free - they are after all virtual
| processors, not real. What is required is that every bit of *architected*
| processor state be renamed or duplicated, prehaps that's what Nick is
| getting at?

You haven't allowed for the problem of access. A simple (CMP)
duplication doesn't increase the connectivity, and can be done
more-or-less by replicating a single core; SMT does, and may need
the linkages redesigning. This might be a fairly simple task for
2-way SMT, though there have been reports that it isn't even for
that, but consider it for 8-way.

| You have to mangle any performance counters and many privileged
| registers fairly horribly, because their meanings and constraints
| change. Similarly, you have to add logic for CPU state change
| synchronisation, because some changes must affect only the current
| thread and some must affect both. And you have to handle the case
| of the two threads attempting incompatible operations simultaneously.
|
| What operations are incompatible. SMT as implemented in the Pentium 4, say,
| allows either virtual processor to do what it likes. One can transition from
| user to kernel and back while the other services interrupts or exceptions or
| whatever. The only coordination needed for proper operation is what is
| needed for two processors - of course the performance may suffer though.

ABSOLUTELY NOT!

Look at the performance counters, think of floating-point modes
(in SMT, they may need to change for each operation), think of
quiescing the other CPU (needed for single to dual thread switching),
think of interrupts (machine check needs one logic, and underflow
another). In ALL cases, on two CPUs, each can operate independently,
but SMT threads can't.

| yes lots of speculation. The difference here is that to CMP processor take
| about twice the silicon of one, while with SMT you have the option to use
| 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy SMT
| will die because its more effort than its worth, but my bet is that chips
| will go both routes with SMT and CMP. Just one more little problem for the
| OS developers to deal with

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).

Regards,
Nick Maclaren.

#**224** October 4th 04, 05:21 PM

"Nick Maclaren" wrote in message
...

In article ,
"Peter Dickerson" writes:
|
| I think that Nick is muddled on this one. If the base implementation is
| already OoO then there will normally be many more physical registers
than
| architected ones. To go two-way SMT may not involve adding any physical
| registers, but rather involve changes to renaming. "dual port" every
| execution unit doesn't make much sense to me. Access to execution units
from
| either virtual processor is essentially free - they are after all
virtual
| processors, not real. What is required is that every bit of
*architected*
| processor state be renamed or duplicated, prehaps that's what Nick is
| getting at?

You haven't allowed for the problem of access. A simple (CMP)
duplication doesn't increase the connectivity, and can be done
more-or-less by replicating a single core; SMT does, and may need
the linkages redesigning. This might be a fairly simple task for
2-way SMT, though there have been reports that it isn't even for
that, but consider it for 8-way.

I think we must be talking at cross purposes because to me an 8-way SMT is
very little different from a 2-way. Bigger register files for architected
state and a few more bits into the renamer. I don't know what you mean by
linkages in this context. Linkages between what and what?

| You have to mangle any performance counters and many privileged
| registers fairly horribly, because their meanings and constraints
| change. Similarly, you have to add logic for CPU state change
| synchronisation, because some changes must affect only the current
| thread and some must affect both. And you have to handle the case
| of the two threads attempting incompatible operations simultaneously.
|
| What operations are incompatible. SMT as implemented in the Pentium 4,
say,
| allows either virtual processor to do what it likes. One can transition
from
| user to kernel and back while the other services interrupts or
exceptions or
| whatever. The only coordination needed for proper operation is what is
| needed for two processors - of course the performance may suffer
though.

ABSOLUTELY NOT!

Look at the performance counters, think of floating-point modes
(in SMT, they may need to change for each operation), think of
quiescing the other CPU (needed for single to dual thread switching),
think of interrupts (machine check needs one logic, and underflow
another). In ALL cases, on two CPUs, each can operate independently,
but SMT threads can't.

I don't see this at all. I'm not saying these things are trivial, I'm saying
that most of it has to be done for a single threaded OoO CPU too.

| yes lots of speculation. The difference here is that to CMP processor
take
| about twice the silicon of one, while with SMT you have the option to
use
| 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy
SMT
| will die because its more effort than its worth, but my bet is that
chips
| will go both routes with SMT and CMP. Just one more little problem for
the
| OS developers to deal with

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).

While I'm looking at the cost of make a single threaded OoO CPU into a
multithreaded one. Thatn probably explains much of the disparity above. If I
had enough silcon for two OoO CPU's I'd probable take the extra hit (5%-30%,
or whatever) to add SMT to each core. If the game is how to get the max
performance (by some measure) from a given area of silicon then I'd have to
know how big it is - if its just too small for two seperate cores...

Regards,
Nick Maclaren.

Peter

#**225** October 4th 04, 05:29 PM

| yes lots of speculation. The difference here is that to CMP processor take
| about twice the silicon of one, while with SMT you have the option to use
| 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy SMT
| will die because its more effort than its worth, but my bet is that chips
| will go both routes with SMT and CMP. Just one more little problem for the
| OS developers to deal with

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).

Regards,
Nick Maclaren.

And for those who don't know, I make more detailed explanation. the most
papers on the matter seem to speak about same number of execution units,
BUT in reality large part of the area N² areal complexity so you in
REALITY could get 1.4 or 1.33 with 6%% SMT overhead number of execution
resources EXCEPT cache for same area for SINGLE compared to TWO
processors. And doubling cache typically gives 10% increase overall. So
thats not a disadvantage for CMP, version that DON'T share caches, on
the contrary it reduces cache conflicts. And those 1.3 times execution
resources don't mean 1.3 times single threaded performance as MOST time
you could use only small portion of resources, for single thread. So 2
core CMP for instance vs SMT would be 6 way VS 8 way for SMT, for
similar area... And for having separate caches wouldn't hurt too much.
Especially if other CPU could have quickly access on 2nd CPU L2 ... With
OWN set of L2 tags for it, and without updating the other CPU:s L2$ LRU
state and shared victim cache .... Now you have about twice the cache
bandwith, and less cache latency, and can avoid strange cache conflicts
between threads. Besides shared L2 I$ helps on I$ hit rate...
Yes there needs to be balancing between having more cores, and more
powerful cores, but current papers on matter penalize CMP because they
don't take in account any kind of design trade offs that make CMP
machine have MORE execution resources in total, and less cache
contention and lower latencies on a hit.

Jouni Osmala

#**226** October 4th 04, 05:39 PM

In article ,
"Peter Dickerson" writes:
|
| You haven't allowed for the problem of access. A simple (CMP)
| duplication doesn't increase the connectivity, and can be done
| more-or-less by replicating a single core; SMT does, and may need
| the linkages redesigning. This might be a fairly simple task for
| 2-way SMT, though there have been reports that it isn't even for
| that, but consider it for 8-way.
|
| I think we must be talking at cross purposes because to me an 8-way SMT is
| very little different from a 2-way. Bigger register files for architected
| state and a few more bits into the renamer. I don't know what you mean by
| linkages in this context. Linkages between what and what?

Between the register file and the execution units, and between
execution units. The point is the days when 'wiring' was cheap
are no more - at least according to every source I have heard!

| Look at the performance counters, think of floating-point modes
| (in SMT, they may need to change for each operation), think of
| quiescing the other CPU (needed for single to dual thread switching),
| think of interrupts (machine check needs one logic, and underflow
| another). In ALL cases, on two CPUs, each can operate independently,
| but SMT threads can't.
|
| I don't see this at all. I'm not saying these things are trivial, I'm saying
| that most of it has to be done for a single threaded OoO CPU too.

No, they don't. Take performance counters. In an OoO CPU, you have
a single process and single core, so you accumulate the counter and,
at context switch, update the process state. With SMT, you have
multiple processes and multiple cores - where does the time taken
(or events occurring) in an execution unit get assigned to? The
Pentium 4 kludges this horribly.

Consider mode switching. In an OoO CPU, a typical mode switch is
a synchronisation point, and is reset on a context switch. With
SMT, a mode must be per-thread (which was said by hardware people
to be impossible a decade ago).

Consider interrupt handling. Underflow etc. had better be handled
within its thread, because the other might be non-interruptible
(and think scalability). But you had BETTER not handle all machine
checks like that (such as ones that disable an execution unit, in
a high-RAS design), as the execution units are in common.

Consider quiescing the other CPU to switch between single and dual
thread mode, to handle a machine check or whatever. You had BETTER
ensure that both CPUs don't do it at once ....

Regards,
Nick Maclaren.

#**227** October 4th 04, 06:15 PM

"Robert Redelmeier" wrote in message
...
In comp.sys.ibm.pc.hardware.chips Peter Boyle
wrote:
On Mon, 4 Oct 2004, Robert Redelmeier wrote:
Code type matters. SMT is best for continuing work during
the ~300 clock memory fetch latency.

What is the evidence to back up this claim?

Logic. When else can SMT really do net increased work?
If you want to test, run some pointer-chasers.

I would however claim that functional units are almost free,

This is getting more and more true as caches grow, but only
from an areal perspective. A multiplier still sucks back a
huge amount of power and tosses it as heat.

Which is an argument for SMT over CMP. With SMP, you can "share" one
multiplier between the two threads (assuming they are not both heavy users
of multiply - which is true for lots of server type workloads), wheras a CMP
would require two multipliers with all the power and heat issues that
implies.

--
- Stephen Fuld
e-mail address disguised to prevent spam

#**228** October 4th 04, 06:22 PM

"Nick Maclaren" wrote in message
...

snip

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).

Probably because it can't be done. I think virtually everyone here believes
that the extra silicon area for a two way SMP is much less than 100% of the
die area of the core. Thus a two way SMP will use less die area, power,
etc. than a two way CMP and the comparison that you specify can't be done.
Let me repeat, I am not an SMP bigot. It seems to me that it is a usefull
tool, along with others, including CMP in the designers tool box. As
someone else has said, I expect the future to be combinations of both, along
with multiple chips per PCB and multiple PCBs per system.

--
- Stephen Fuld
e-mail address disguised to prevent spam

#**229** October 4th 04, 06:44 PM

In article , first{dot}
says...
"Nick Maclaren" wrote in message
...

In article ,
Robert Redelmeier writes:
| In comp.sys.ibm.pc.hardware.chips Nick Maclaren
wrote:
| Robert Redelmeier writes:
| | A very good point. SMT is a fairly simple thing.
| | Orthogonal to other efforts to improve performance.
|
| Boggle. If it were either, let alone both, it would be
| vastly more effective.
|
| SMT is simple in that "all" that needs be done is create
| duplicate state machines (register sets) to create "virtual
| CPUs". Add some (not too much) fairness to the hardware
| scheduler and thread through the retirement unit. The main
| execution pipeline (ROB, ports, exec units) remains unchanged.

That is wrong, completely so.

You DON'T just create duplicate register sets, but have to "dual
port" every execution unit - possible by creating a single set
of double the length, and create some new scheduling to manage it.
You have to move some privileged registers and state from out of
(logically) the execution units to the register sets.

I think that Nick is muddled on this one. If the base implementation is
already OoO then there will normally be many more physical registers than
architected ones. To go two-way SMT may not involve adding any physical
registers, but rather involve changes to renaming. "dual port" every
execution unit doesn't make much sense to me. Access to execution units from
either virtual processor is essentially free - they are after all virtual
processors, not real. What is required is that every bit of *architected*
processor state be renamed or duplicated, prehaps that's what Nick is
getting at?

Sure, simply tag the register names with the thread ID and let the
renaming take care of sorting out the threads' architected resources.
Pretty much everything in a modern processor has to be renamed anyway.

--
Keith

#**230** October 4th 04, 08:31 PM

In article ,
Stephen Fuld wrote:

"Nick Maclaren" wrote in message
...

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).

Probably because it can't be done. I think virtually everyone here believes
that the extra silicon area for a two way SMP is much less than 100% of the
die area of the core. Thus a two way SMP will use less die area, power,
etc. than a two way CMP and the comparison that you specify can't be done.
Let me repeat, I am not an SMP bigot. It seems to me that it is a usefull
tool, along with others, including CMP in the designers tool box. As
someone else has said, I expect the future to be combinations of both, along
with multiple chips per PCB and multiple PCBs per system.

In the above, you mean SMT, I assume.

It's been possible for at least 5 years, probably 10. Yes, the cores
of a CMP system would necessarily be simpler, but it becomes possible
as soon as the transistor count of the latest and greatest model in
the range exceeds doubt that of the simplest. Well, roughly, and
allowing for the difference between code and data transistors.

Regards,
Nick Maclaren.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Intel Prescott CPU in a Nutshell	LuvrSmel	Overclocking	1	January 10th 05 03:23 PM
Intel chipsets are the most stable?	Grumble	Homebuilt PC's	101	October 26th 04 02:53 AM
Real World Comparisons: AMD 3200 -vs- Intel 3.2. Your thoughts, experiences....	Ted Grevers	General	33	February 6th 04 02:34 PM
Intel & 65nm	Yousuf Khan	General	0	November 25th 03 01:18 AM
Intel Updates Plans Again: Adds Pentium 4 EE at 3.40GHz and Pentium 4 at 3.40GHz	lyon_wonder	General	2	November 10th 03 11:17 PM