A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » Processors » General
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

65nm news from Intel



 
 
Thread Tools Display Modes
  #221  
Old October 4th 04, 04:29 PM
Robert Redelmeier
external usenet poster
 
Posts: n/a
Default

In comp.sys.ibm.pc.hardware.chips Peter Boyle wrote:
On Mon, 4 Oct 2004, Robert Redelmeier wrote:
Code type matters. SMT is best for continuing work during
the ~300 clock memory fetch latency.


What is the evidence to back up this claim?


Logic. When else can SMT really do net increased work?
If you want to test, run some pointer-chasers.

I would however claim that functional units are almost free,


This is getting more and more true as caches grow, but only
from an areal perspective. A multiplier still sucks back a
huge amount of power and tosses it as heat.

CMP will also help the former [bandwidth].


Nope, not without a second memory bus and all those pins.

-- Robert

  #222  
Old October 4th 04, 04:33 PM
Nick Maclaren
external usenet poster
 
Posts: n/a
Default


In article k,
Peter Boyle writes:
| On Mon, 4 Oct 2004, Robert Redelmeier wrote:
|
| Code type matters. SMT is best for continuing work during
| the ~300 clock memory fetch latency.
|
| What is the evidence to back up this claim?
|
| Not theories, but _evidence_ of bigger speed up compared to,
| for example, switch on event multi-threading, or CMP with simpler
| and smaller processors, but not sharing L1 cache.
|
| Note that I'm not claiming evidence the other way, but as far as
| I can tell the jury is out on the best organisation for concurrency
| on chip.

I should be happy to see even a theoretical analysis - I wasn't
impressed by Eggers's omission of a comparable CMP for comparison
purposes.


Regards,
Nick Maclaren.
  #223  
Old October 4th 04, 04:46 PM
Nick Maclaren
external usenet poster
 
Posts: n/a
Default


In article ,
"Peter Dickerson" writes:
|
| I think that Nick is muddled on this one. If the base implementation is
| already OoO then there will normally be many more physical registers than
| architected ones. To go two-way SMT may not involve adding any physical
| registers, but rather involve changes to renaming. "dual port" every
| execution unit doesn't make much sense to me. Access to execution units from
| either virtual processor is essentially free - they are after all virtual
| processors, not real. What is required is that every bit of *architected*
| processor state be renamed or duplicated, prehaps that's what Nick is
| getting at?

You haven't allowed for the problem of access. A simple (CMP)
duplication doesn't increase the connectivity, and can be done
more-or-less by replicating a single core; SMT does, and may need
the linkages redesigning. This might be a fairly simple task for
2-way SMT, though there have been reports that it isn't even for
that, but consider it for 8-way.

| You have to mangle any performance counters and many privileged
| registers fairly horribly, because their meanings and constraints
| change. Similarly, you have to add logic for CPU state change
| synchronisation, because some changes must affect only the current
| thread and some must affect both. And you have to handle the case
| of the two threads attempting incompatible operations simultaneously.
|
| What operations are incompatible. SMT as implemented in the Pentium 4, say,
| allows either virtual processor to do what it likes. One can transition from
| user to kernel and back while the other services interrupts or exceptions or
| whatever. The only coordination needed for proper operation is what is
| needed for two processors - of course the performance may suffer though.

ABSOLUTELY NOT!

Look at the performance counters, think of floating-point modes
(in SMT, they may need to change for each operation), think of
quiescing the other CPU (needed for single to dual thread switching),
think of interrupts (machine check needs one logic, and underflow
another). In ALL cases, on two CPUs, each can operate independently,
but SMT threads can't.

| yes lots of speculation. The difference here is that to CMP processor take
| about twice the silicon of one, while with SMT you have the option to use
| 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy SMT
| will die because its more effort than its worth, but my bet is that chips
| will go both routes with SMT and CMP. Just one more little problem for the
| OS developers to deal with

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).


Regards,
Nick Maclaren.
  #224  
Old October 4th 04, 05:21 PM
Peter Dickerson
external usenet poster
 
Posts: n/a
Default

"Nick Maclaren" wrote in message
...

In article ,
"Peter Dickerson" writes:
|
| I think that Nick is muddled on this one. If the base implementation is
| already OoO then there will normally be many more physical registers

than
| architected ones. To go two-way SMT may not involve adding any physical
| registers, but rather involve changes to renaming. "dual port" every
| execution unit doesn't make much sense to me. Access to execution units

from
| either virtual processor is essentially free - they are after all

virtual
| processors, not real. What is required is that every bit of

*architected*
| processor state be renamed or duplicated, prehaps that's what Nick is
| getting at?

You haven't allowed for the problem of access. A simple (CMP)
duplication doesn't increase the connectivity, and can be done
more-or-less by replicating a single core; SMT does, and may need
the linkages redesigning. This might be a fairly simple task for
2-way SMT, though there have been reports that it isn't even for
that, but consider it for 8-way.


I think we must be talking at cross purposes because to me an 8-way SMT is
very little different from a 2-way. Bigger register files for architected
state and a few more bits into the renamer. I don't know what you mean by
linkages in this context. Linkages between what and what?

| You have to mangle any performance counters and many privileged
| registers fairly horribly, because their meanings and constraints
| change. Similarly, you have to add logic for CPU state change
| synchronisation, because some changes must affect only the current
| thread and some must affect both. And you have to handle the case
| of the two threads attempting incompatible operations simultaneously.
|
| What operations are incompatible. SMT as implemented in the Pentium 4,

say,
| allows either virtual processor to do what it likes. One can transition

from
| user to kernel and back while the other services interrupts or

exceptions or
| whatever. The only coordination needed for proper operation is what is
| needed for two processors - of course the performance may suffer

though.

ABSOLUTELY NOT!

Look at the performance counters, think of floating-point modes
(in SMT, they may need to change for each operation), think of
quiescing the other CPU (needed for single to dual thread switching),
think of interrupts (machine check needs one logic, and underflow
another). In ALL cases, on two CPUs, each can operate independently,
but SMT threads can't.


I don't see this at all. I'm not saying these things are trivial, I'm saying
that most of it has to be done for a single threaded OoO CPU too.

| yes lots of speculation. The difference here is that to CMP processor

take
| about twice the silicon of one, while with SMT you have the option to

use
| 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy

SMT
| will die because its more effort than its worth, but my bet is that

chips
| will go both routes with SMT and CMP. Just one more little problem for

the
| OS developers to deal with

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).


While I'm looking at the cost of make a single threaded OoO CPU into a
multithreaded one. Thatn probably explains much of the disparity above. If I
had enough silcon for two OoO CPU's I'd probable take the extra hit (5%-30%,
or whatever) to add SMT to each core. If the game is how to get the max
performance (by some measure) from a given area of silicon then I'd have to
know how big it is - if its just too small for two seperate cores...

Regards,
Nick Maclaren.


Peter


  #225  
Old October 4th 04, 05:29 PM
Jouni Osmala
external usenet poster
 
Posts: n/a
Default

| yes lots of speculation. The difference here is that to CMP processor take
| about twice the silicon of one, while with SMT you have the option to use
| 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy SMT
| will die because its more effort than its worth, but my bet is that chips
| will go both routes with SMT and CMP. Just one more little problem for the
| OS developers to deal with

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).


Regards,
Nick Maclaren.


And for those who don't know, I make more detailed explanation. the most
papers on the matter seem to speak about same number of execution units,
BUT in reality large part of the area NČ areal complexity so you in
REALITY could get 1.4 or 1.33 with 6%% SMT overhead number of execution
resources EXCEPT cache for same area for SINGLE compared to TWO
processors. And doubling cache typically gives 10% increase overall. So
thats not a disadvantage for CMP, version that DON'T share caches, on
the contrary it reduces cache conflicts. And those 1.3 times execution
resources don't mean 1.3 times single threaded performance as MOST time
you could use only small portion of resources, for single thread. So 2
core CMP for instance vs SMT would be 6 way VS 8 way for SMT, for
similar area... And for having separate caches wouldn't hurt too much.
Especially if other CPU could have quickly access on 2nd CPU L2 ... With
OWN set of L2 tags for it, and without updating the other CPU:s L2$ LRU
state and shared victim cache .... Now you have about twice the cache
bandwith, and less cache latency, and can avoid strange cache conflicts
between threads. Besides shared L2 I$ helps on I$ hit rate...
Yes there needs to be balancing between having more cores, and more
powerful cores, but current papers on matter penalize CMP because they
don't take in account any kind of design trade offs that make CMP
machine have MORE execution resources in total, and less cache
contention and lower latencies on a hit.

Jouni Osmala
  #226  
Old October 4th 04, 05:39 PM
Nick Maclaren
external usenet poster
 
Posts: n/a
Default


In article ,
"Peter Dickerson" writes:
|
| You haven't allowed for the problem of access. A simple (CMP)
| duplication doesn't increase the connectivity, and can be done
| more-or-less by replicating a single core; SMT does, and may need
| the linkages redesigning. This might be a fairly simple task for
| 2-way SMT, though there have been reports that it isn't even for
| that, but consider it for 8-way.
|
| I think we must be talking at cross purposes because to me an 8-way SMT is
| very little different from a 2-way. Bigger register files for architected
| state and a few more bits into the renamer. I don't know what you mean by
| linkages in this context. Linkages between what and what?

Between the register file and the execution units, and between
execution units. The point is the days when 'wiring' was cheap
are no more - at least according to every source I have heard!

| Look at the performance counters, think of floating-point modes
| (in SMT, they may need to change for each operation), think of
| quiescing the other CPU (needed for single to dual thread switching),
| think of interrupts (machine check needs one logic, and underflow
| another). In ALL cases, on two CPUs, each can operate independently,
| but SMT threads can't.
|
| I don't see this at all. I'm not saying these things are trivial, I'm saying
| that most of it has to be done for a single threaded OoO CPU too.

No, they don't. Take performance counters. In an OoO CPU, you have
a single process and single core, so you accumulate the counter and,
at context switch, update the process state. With SMT, you have
multiple processes and multiple cores - where does the time taken
(or events occurring) in an execution unit get assigned to? The
Pentium 4 kludges this horribly.

Consider mode switching. In an OoO CPU, a typical mode switch is
a synchronisation point, and is reset on a context switch. With
SMT, a mode must be per-thread (which was said by hardware people
to be impossible a decade ago).

Consider interrupt handling. Underflow etc. had better be handled
within its thread, because the other might be non-interruptible
(and think scalability). But you had BETTER not handle all machine
checks like that (such as ones that disable an execution unit, in
a high-RAS design), as the execution units are in common.

Consider quiescing the other CPU to switch between single and dual
thread mode, to handle a machine check or whatever. You had BETTER
ensure that both CPUs don't do it at once ....


Regards,
Nick Maclaren.
  #227  
Old October 4th 04, 06:15 PM
Stephen Fuld
external usenet poster
 
Posts: n/a
Default


"Robert Redelmeier" wrote in message
...
In comp.sys.ibm.pc.hardware.chips Peter Boyle
wrote:
On Mon, 4 Oct 2004, Robert Redelmeier wrote:
Code type matters. SMT is best for continuing work during
the ~300 clock memory fetch latency.


What is the evidence to back up this claim?


Logic. When else can SMT really do net increased work?
If you want to test, run some pointer-chasers.

I would however claim that functional units are almost free,


This is getting more and more true as caches grow, but only
from an areal perspective. A multiplier still sucks back a
huge amount of power and tosses it as heat.


Which is an argument for SMT over CMP. With SMP, you can "share" one
multiplier between the two threads (assuming they are not both heavy users
of multiply - which is true for lots of server type workloads), wheras a CMP
would require two multipliers with all the power and heat issues that
implies.

--
- Stephen Fuld
e-mail address disguised to prevent spam


  #228  
Old October 4th 04, 06:22 PM
Stephen Fuld
external usenet poster
 
Posts: n/a
Default


"Nick Maclaren" wrote in message
...

snip

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).


Probably because it can't be done. I think virtually everyone here believes
that the extra silicon area for a two way SMP is much less than 100% of the
die area of the core. Thus a two way SMP will use less die area, power,
etc. than a two way CMP and the comparison that you specify can't be done.
Let me repeat, I am not an SMP bigot. It seems to me that it is a usefull
tool, along with others, including CMP in the designers tool box. As
someone else has said, I expect the future to be combinations of both, along
with multiple chips per PCB and multiple PCBs per system.

--
- Stephen Fuld
e-mail address disguised to prevent spam


  #229  
Old October 4th 04, 06:44 PM
Keith R. Williams
external usenet poster
 
Posts: n/a
Default

In article , first{dot}
says...
"Nick Maclaren" wrote in message
...

In article ,
Robert Redelmeier writes:
| In comp.sys.ibm.pc.hardware.chips Nick Maclaren

wrote:
| Robert Redelmeier writes:
| | A very good point. SMT is a fairly simple thing.
| | Orthogonal to other efforts to improve performance.
|
| Boggle. If it were either, let alone both, it would be
| vastly more effective.
|
| SMT is simple in that "all" that needs be done is create
| duplicate state machines (register sets) to create "virtual
| CPUs". Add some (not too much) fairness to the hardware
| scheduler and thread through the retirement unit. The main
| execution pipeline (ROB, ports, exec units) remains unchanged.

That is wrong, completely so.

You DON'T just create duplicate register sets, but have to "dual
port" every execution unit - possible by creating a single set
of double the length, and create some new scheduling to manage it.
You have to move some privileged registers and state from out of
(logically) the execution units to the register sets.


I think that Nick is muddled on this one. If the base implementation is
already OoO then there will normally be many more physical registers than
architected ones. To go two-way SMT may not involve adding any physical
registers, but rather involve changes to renaming. "dual port" every
execution unit doesn't make much sense to me. Access to execution units from
either virtual processor is essentially free - they are after all virtual
processors, not real. What is required is that every bit of *architected*
processor state be renamed or duplicated, prehaps that's what Nick is
getting at?


Sure, simply tag the register names with the thread ID and let the
renaming take care of sorting out the threads' architected resources.
Pretty much everything in a modern processor has to be renamed anyway.

--
Keith
  #230  
Old October 4th 04, 08:31 PM
Nick Maclaren
external usenet poster
 
Posts: n/a
Default

In article ,
Stephen Fuld wrote:

"Nick Maclaren" wrote in message
...

I am referring to the fair comparison between a 2-way SMT and a
dual-core CMP using the same amount of silicon, power etc. THAT
is what should have been compared - but I can find no evidence
that it was (though it probably was).


Probably because it can't be done. I think virtually everyone here believes
that the extra silicon area for a two way SMP is much less than 100% of the
die area of the core. Thus a two way SMP will use less die area, power,
etc. than a two way CMP and the comparison that you specify can't be done.
Let me repeat, I am not an SMP bigot. It seems to me that it is a usefull
tool, along with others, including CMP in the designers tool box. As
someone else has said, I expect the future to be combinations of both, along
with multiple chips per PCB and multiple PCBs per system.


In the above, you mean SMT, I assume.

It's been possible for at least 5 years, probably 10. Yes, the cores
of a CMP system would necessarily be simpler, but it becomes possible
as soon as the transistor count of the latest and greatest model in
the range exceeds doubt that of the simplest. Well, roughly, and
allowing for the difference between code and data transistors.


Regards,
Nick Maclaren.
 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Intel Prescott CPU in a Nutshell LuvrSmel Overclocking 1 January 10th 05 03:23 PM
Intel chipsets are the most stable? Grumble Homebuilt PC's 101 October 26th 04 02:53 AM
Real World Comparisons: AMD 3200 -vs- Intel 3.2. Your thoughts, experiences.... Ted Grevers General 33 February 6th 04 02:34 PM
Intel & 65nm Yousuf Khan General 0 November 25th 03 01:18 AM
Intel Updates Plans Again: Adds Pentium 4 EE at 3.40GHz and Pentium 4 at 3.40GHz lyon_wonder General 2 November 10th 03 11:17 PM


All times are GMT +1. The time now is 03:21 AM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.