65nm news from Intel

#**211** October 4th 04, 02:34 AM

In comp.sys.ibm.pc.hardware.chips keith wrote:
I'm not against SMP at all, if it's free I'll take it (and
have predicted multiple core processors here for at leat five
years), but to say it's somehow "free" today, is *nutz*. Even
a short few years ago I stated that two complete systeems
were better than one dual. I think the line is crossing
soon to the dual-porcessor, but I'd rather have two systems.
...both duals soon. ;-)

I think you're a bit behind the times

I've been running an Abit BP6 (dual OC Celerons) as my main
machine since July 1999. Current uptime 195 days. IIRC when
I built it, the premium for dual was $75. Effectively zero,
especially considering the life extention.

But two complete systems are still better for some things
(backup, MS-Windows) and always will be.

-- Robert

#**212** October 4th 04, 03:44 AM

On Mon, 04 Oct 2004 01:34:39 +0000, Robert Redelmeier wrote:

In comp.sys.ibm.pc.hardware.chips keith wrote:
I'm not against SMP at all, if it's free I'll take it (and
have predicted multiple core processors here for at leat five
years), but to say it's somehow "free" today, is *nutz*. Even
a short few years ago I stated that two complete systeems
were better than one dual. I think the line is crossing
soon to the dual-porcessor, but I'd rather have two systems.
...both duals soon. ;-)

I think you're a bit behind the times

Well, I was talking about single-chip SMP. Even at that it was rather
obvious (I believe I argued with Fleger over this). What else to do with
infinite transistor budgets after caches? Actually *designing* a way of
using transistors is exponentially difficult. Doubling cacches is more or
less linear, as is another processor.

I've been running an Abit BP6 (dual OC Celerons) as my main machine
since July 1999. Current uptime 195 days. IIRC when I built it, the
premium for dual was $75. Effectively zero, especially considering the
life extention.

When I looked (a few months ago) a decent dual AthlonMP board was around
$400, with the processors at a rather premium too. I was *considering a
dual K7 at the time, rather than a single K8. The duals lost because of
the cost. It would have been cheaper to upgrade the second system than go
SMP.

But two complete systems are still better for some things (backup,
MS-Windows) and always will be.

....particularly when Linux is on this one. ;-)

--
Keith

#**213** October 4th 04, 08:37 AM

In article ,
keith writes:
| On Sun, 03 Oct 2004 09:24:06 -0700, Eugene Miya wrote:
| Stefan Monnier wrote:
|
| Your second CPU will be mostly idle, of course, but so is the first CPU
| anyway ;-)
|
| Yeah, but that's not bad.
| 2nd CPUs are cheap these days.
|
| You may htinf the second is "cheap", but I don't. The second CPU and the
| board that dgoes with it are certainly *not* "cheap".

What board?

The cost difference is far more marketing than production. Dual
CPU boards are sold as 'servers' and as 'performance workstations',
both at a premium. They could equally well be sold with the same
margin as the 'economy' boards.

Regards,
Nick Maclaren.

#**214** October 4th 04, 02:25 PM

In comp.sys.ibm.pc.hardware.chips keith wrote:
Well, I was talking about single-chip SMP.

Sorry, I missed that upthread.

What else to do with infinite transistor
budgets after caches?

A very good point. SMT is a fairly simple thing.
Orthogonal to other efforts to improve performance.

Actually *designing* a way of using
transistors is exponentially difficult.

True enough. You run out of orthogonalities

When I looked (a few months ago) a decent dual AthlonMP board
was around $400, with the processors at a rather premium too.

Decent? What do you classify as decent? I see'em around $200,
and surely you don't shy away from fixing painted jumpers?
I figure the dual premium is around $200 now.

...particularly when Linux is on this one. ;-)

Oh, I see you're still running the K6-3.
No reason to stop.

-- Robert

#**215** October 4th 04, 02:32 PM

In article ,
Robert Redelmeier writes:
|
| Well, I was talking about single-chip SMP.
|
| Sorry, I missed that upthread.
|
| What else to do with infinite transistor
| budgets after caches?
|
| A very good point. SMT is a fairly simple thing.
| Orthogonal to other efforts to improve performance.

Boggle. If it were either, let alone both, it would be vastly
more effective.

Regards,
Nick Maclaren.

#**216** October 4th 04, 03:36 PM

In comp.sys.ibm.pc.hardware.chips Nick Maclaren wrote:
Robert Redelmeier writes:
| A very good point. SMT is a fairly simple thing.
| Orthogonal to other efforts to improve performance.

Boggle. If it were either, let alone both, it would be
vastly more effective.

SMT is simple in that "all" that needs be done is create
duplicate state machines (register sets) to create "virtual
CPUs". Add some (not too much) fairness to the hardware
scheduler and thread through the retirement unit. The main
execution pipeline (ROB, ports, exec units) remains unchanged.

"Vastly more effective" is a comparative term. What do you
expect? SMT won't match SMP under most circumstances. You
don't have the ports or exec units! It'll be particularly
lame on the P7 because that throwback is short of issue ports.

Code type matters. SMT is best for continuing work during
the ~300 clock memory fetch latency. You'd rather the CPU
just stall? But most optimized code has already done
prefetching and is either bandwidth or compute limited.
SMT will help with neither. SMP will only help the latter.

-- Robert

#**217** October 4th 04, 03:50 PM

In article ,
Robert Redelmeier writes:
| In comp.sys.ibm.pc.hardware.chips Nick Maclaren wrote:
| Robert Redelmeier writes:
| | A very good point. SMT is a fairly simple thing.
| | Orthogonal to other efforts to improve performance.
|
| Boggle. If it were either, let alone both, it would be
| vastly more effective.
|
| SMT is simple in that "all" that needs be done is create
| duplicate state machines (register sets) to create "virtual
| CPUs". Add some (not too much) fairness to the hardware
| scheduler and thread through the retirement unit. The main
| execution pipeline (ROB, ports, exec units) remains unchanged.

That is wrong, completely so.

You DON'T just create duplicate register sets, but have to "dual
port" every execution unit - possible by creating a single set
of double the length, and create some new scheduling to manage it.
You have to move some privileged registers and state from out of
(logically) the execution units to the register sets.

You have to mangle any performance counters and many privileged
registers fairly horribly, because their meanings and constraints
change. Similarly, you have to add logic for CPU state change
synchronisation, because some changes must affect only the current
thread and some must affect both. And you have to handle the case
of the two threads attempting incompatible operations simultaneously.

Oh, of course, none of this affects the main flow of control,
but all forms of real engineering (as distinct from academic
demonstrations and marketing) are as much or more about the problem
cases as the normal ones.

| "Vastly more effective" is a comparative term. What do you
| expect? SMT won't match SMP under most circumstances. ...

My suspicion is that it wouldn't match CMP, with the same amount
of real estate, under most circumstances. But that is pure
speculation AS IS THE CLAIM OF THE CONVERSE until and unless
someone does some proper analysis.

Regards,
Nick Maclaren.

#**218** October 4th 04, 04:15 PM

On Mon, 4 Oct 2004, Robert Redelmeier wrote:

Code type matters. SMT is best for continuing work during
the ~300 clock memory fetch latency.

What is the evidence to back up this claim?

Not theories, but _evidence_ of bigger speed up compared to,
for example, switch on event multi-threading, or CMP with simpler
and smaller processors, but not sharing L1 cache.

Note that I'm not claiming evidence the other way, but as far as
I can tell the jury is out on the best organisation for concurrency
on chip.

I would however claim that functional units are almost free,
and that the best organisation will win in the long run, not
necessarily the one that best uses a finite number of functional units.

But most optimized code has already done
prefetching and is either bandwidth or compute limited.
SMT will help with neither. SMP will only help the latter.

CMP will also help the former.

Peter

-- Robert

#**219** October 4th 04, 04:17 PM

In comp.sys.ibm.pc.hardware.chips Nick Maclaren wrote:
That is wrong, completely so.

Interesting. Do you have specific specialised knowledge?
Or some reference to exactly how SMT has been implemented?

You DON'T just create duplicate register sets, but have to "dual
port" every execution unit - possible by creating a single set
of double the length, and create some new scheduling to manage it.

This is an awful lot of work compared to simply tagging each
instruction with a thread number which indicates which register
set to operate upon. Then letting everything run through with
the extra bits catching dependancies.

You have to mangle any performance counters and many
privileged registers fairly horribly, because their meanings
and constraints change. Similarly, you have to add logic for
CPU state change synchronisation, because some changes must
affect only the current thread and some must affect both.

I wouldn't expect SMT to _always_ run multi-threaded.
The name is _Symmetrical_ Multi Threading. The moment the
execution environment is driven assymmetrical, I expect
failures. Some changes might require an IPI to restart

And you have to handle the case of the two threads
attempting incompatible operations simultaneously.

Usually this is handled by the OS.

Oh, of course, none of this affects the main flow of control,
but all forms of real engineering (as distinct from academic
demonstrations and marketing) are as much or more about
the problem cases as the normal ones.

It's still engineering if it works 99% of the time so long
as it doesn't fail catastrophically in the other 1%.

I see SMT as a simple, cheap way to use fetch wait cycles.
It just needs to work in the common case, two+ pmode threads
(maybe multiple rings) with different pagemaps. Of course
you can probably make it break. Then you deserve what you get.

-- Robert

#**220** October 4th 04, 04:27 PM

"Nick Maclaren" wrote in message
...

In article ,
Robert Redelmeier writes:
| In comp.sys.ibm.pc.hardware.chips Nick Maclaren
wrote:
| Robert Redelmeier writes:
| | A very good point. SMT is a fairly simple thing.
| | Orthogonal to other efforts to improve performance.
|
| Boggle. If it were either, let alone both, it would be
| vastly more effective.
|
| SMT is simple in that "all" that needs be done is create
| duplicate state machines (register sets) to create "virtual
| CPUs". Add some (not too much) fairness to the hardware
| scheduler and thread through the retirement unit. The main
| execution pipeline (ROB, ports, exec units) remains unchanged.

That is wrong, completely so.

You DON'T just create duplicate register sets, but have to "dual
port" every execution unit - possible by creating a single set
of double the length, and create some new scheduling to manage it.
You have to move some privileged registers and state from out of
(logically) the execution units to the register sets.

I think that Nick is muddled on this one. If the base implementation is
already OoO then there will normally be many more physical registers than
architected ones. To go two-way SMT may not involve adding any physical
registers, but rather involve changes to renaming. "dual port" every
execution unit doesn't make much sense to me. Access to execution units from
either virtual processor is essentially free - they are after all virtual
processors, not real. What is required is that every bit of *architected*
processor state be renamed or duplicated, prehaps that's what Nick is
getting at?

You have to mangle any performance counters and many privileged
registers fairly horribly, because their meanings and constraints
change. Similarly, you have to add logic for CPU state change
synchronisation, because some changes must affect only the current
thread and some must affect both. And you have to handle the case
of the two threads attempting incompatible operations simultaneously.

What operations are incompatible. SMT as implemented in the Pentium 4, say,
allows either virtual processor to do what it likes. One can transition from
user to kernel and back while the other services interrupts or exceptions or
whatever. The only coordination needed for proper operation is what is
needed for two processors - of course the performance may suffer though.

Oh, of course, none of this affects the main flow of control,
but all forms of real engineering (as distinct from academic
demonstrations and marketing) are as much or more about the problem
cases as the normal ones.

| "Vastly more effective" is a comparative term. What do you
| expect? SMT won't match SMP under most circumstances. ...

My suspicion is that it wouldn't match CMP, with the same amount
of real estate, under most circumstances. But that is pure
speculation AS IS THE CLAIM OF THE CONVERSE until and unless
someone does some proper analysis.

yes lots of speculation. The difference here is that to CMP processor take
about twice the silicon of one, while with SMT you have the option to use
1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy SMT
will die because its more effort than its worth, but my bet is that chips
will go both routes with SMT and CMP. Just one more little problem for the
OS developers to deal with

Regards,
Nick Maclaren.

Peter

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Intel Prescott CPU in a Nutshell	LuvrSmel	Overclocking	1	January 10th 05 03:23 PM
Intel chipsets are the most stable?	Grumble	Homebuilt PC's	101	October 26th 04 02:53 AM
Real World Comparisons: AMD 3200 -vs- Intel 3.2. Your thoughts, experiences....	Ted Grevers	General	33	February 6th 04 02:34 PM
Intel & 65nm	Yousuf Khan	General	0	November 25th 03 01:18 AM
Intel Updates Plans Again: Adds Pentium 4 EE at 3.40GHz and Pentium 4 at 3.40GHz	lyon_wonder	General	2	November 10th 03 11:17 PM