65nm news from Intel

#**251** October 5th 04, 11:42 AM

"Nick Maclaren" wrote in message
...

In article ,
"Peter Dickerson" writes:
|
| You haven't allowed for the problem of access. A simple (CMP)
| duplication doesn't increase the connectivity, and can be done
| more-or-less by replicating a single core; SMT does, and may need
| the linkages redesigning. This might be a fairly simple task for
| 2-way SMT, though there have been reports that it isn't even for
| that, but consider it for 8-way.
|
| I think we must be talking at cross purposes because to me an 8-way SMT
is
| very little different from a 2-way. Bigger register files for
architected
| state and a few more bits into the renamer. I don't know what you mean
by
| linkages in this context. Linkages between what and what?

Between the register file and the execution units, and between
execution units. The point is the days when 'wiring' was cheap
are no more - at least according to every source I have heard!

We've been talking at cross purposes. I'm saying what needs to be added to a
CPU to give it SMT capabilities, extracting extra work from the otherwise
idle execution units. You seem to be saying what needs to be added to a CPU
to give it the performance of to CMP cores. I don't think that SMT should be
viewed that way, at least not for current implementation. I see it as
something that can be squeezed out for a little bit of extra silicon - the
register file and execution units wouldn't change in my view.

| Look at the performance counters, think of floating-point modes
| (in SMT, they may need to change for each operation), think of
| quiescing the other CPU (needed for single to dual thread switching),
| think of interrupts (machine check needs one logic, and underflow
| another). In ALL cases, on two CPUs, each can operate independently,
| but SMT threads can't.
|
| I don't see this at all. I'm not saying these things are trivial, I'm
saying
| that most of it has to be done for a single threaded OoO CPU too.

No, they don't. Take performance counters. In an OoO CPU, you have
a single process and single core, so you accumulate the counter and,
at context switch, update the process state. With SMT, you have
multiple processes and multiple cores - where does the time taken
(or events occurring) in an execution unit get assigned to? The
Pentium 4 kludges this horribly.

Here, you would need two (or whatever) copies of the performance counters.

Consider mode switching. In an OoO CPU, a typical mode switch is
a synchronisation point, and is reset on a context switch. With
SMT, a mode must be per-thread (which was said by hardware people
to be impossible a decade ago).

But is exactly what the Pentium 4 does. Each virtual CPU does its own thing.
You use the word thread here which I don't use in this context. It is
utterly trivial for an OoO design.

Consider interrupt handling. Underflow etc. had better be handled
within its thread, because the other might be non-interruptible
(and think scalability). But you had BETTER not handle all machine
checks like that (such as ones that disable an execution unit, in
a high-RAS design), as the execution units are in common.

Interrupts and exceptions are handled in exactly the same way as two cores
or chips would do. Exceptions are taken by the virtual processor that
triggered it. Harware interrupts are taken by whichever (virtual) processor
it is assigned to by the interrupt controller (APIC in PC style designs).

Machine checks are taken on each virtual CPU as and when that CPU detects
it. If the machine state is architecturally visible then it is duplicated.

Consider quiescing the other CPU to switch between single and dual
thread mode, to handle a machine check or whatever. You had BETTER
ensure that both CPUs don't do it at once ....

I don't understand this. If you are switching from single virtual CPU to two
how can the second be doing anything until it exists?

Regards,
Nick Maclaren.

Peter

#**252** October 5th 04, 01:41 PM

In comp.sys.ibm.pc.hardware.chips Peter Boyle wrote:
pointing out that latency tolerance is the best use for SMT.

This is correct. I don't see SMT as a quick route to
maximum performance, but rather a cheap route to somewhat
better performance. SMT could be done in combination with
SMP or even CMP. Do something during the stalls.

Sorry for jumping in too hastily,

Britain and America. Divided by a common language

-- Robert

#**253** October 5th 04, 01:52 PM

In comp.sys.ibm.pc.hardware.chips Joe Seigh wrote:
What about IDE? That seems to be rather cpu intensive
when you're doing a lot of i/o.

Not anymore. Modern IDE chipsets do Busmaster DMA and is
fairly low CPU overhead. AFAIK, IDE still doesn't have a
multicommand bus which SCSI has always had (don't need to
wait for seeks to complete). So SCSI is preferable for more
than one intense device per bus.

-- Robert

#**254** October 5th 04, 02:13 PM

Robert Redelmeier wrote:

AFAIK, IDE still doesn't have a
multicommand bus which SCSI has always had (don't need to
wait for seeks to complete).

SATA-II has native command queuing, NCQ. AFAIK, you could use SCSI command
queuing on SATA if the device did understand it, since SATA (like ATAPI)
allows to transfer SCSI commands, and therefore to use SCSI's command
queuing mechanism ("tagged command queuing", TCQ).

--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/

#**255** October 5th 04, 06:58 PM

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In article ,
keith wrote:
When I looked (a few months ago) a decent dual AthlonMP board was around
$400, with the processors at a rather premium too. I was *considering a
dual K7 at the time, rather than a single K8. The duals lost because of
the cost. It would have been cheaper to upgrade the second system than go
SMP.

I've been running a Tyan S2466N-4M with a pair of Athlon MP 2100s at home
now for a couple of years or so (have been thinking of a pair of faster
processors as a cheap upgrade lately). Pricewatch puts this board at about
$190, with tray processors at about $90 each. IIRC, the board didn't cost
much more back when I bought it than it does now. (The processors were a
fair bit more expensive, but they were only one or two steps down from the
fastest-available speed at the time.)

_/_
/ v \ Scott Alfter (remove the obvious to send mail)
(IIGS( http://alfter.us/ Top-posting!
\_^_/ rm -rf /bin/laden What's the most annoying thing on Usenet?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (Linux)

iD8DBQFBYuC2VgTKos01OwkRAsInAJ0ccEEa2z7XCkn/K2ZON/v7U+uEBACg48v/
O8puH7rQbb9dQLGvyE7iw6o=
=6IBu
-----END PGP SIGNATURE-----

#**256** October 5th 04, 07:57 PM

In comp.arch Robert Redelmeier wrote:
In comp.sys.ibm.pc.hardware.chips Peter Boyle wrote:
On Mon, 4 Oct 2004, Robert Redelmeier wrote:
Code type matters. SMT is best for continuing work during
the ~300 clock memory fetch latency.

What is the evidence to back up this claim?

Logic. When else can SMT really do net increased work?

Result derived from logic that is not backed up with
evidence usualy turn out to be wrong.

If you want to test, run some pointer-chasers.

So why do you claim logic and don't post your own results?

[snip]

CMP will also help the former [bandwidth].

Nope, not without a second memory bus and all those pins.

Provided they share at least one level of cache, what you
just said is completely false.

-- Robert

--
Sander

+++ Out of cheese error +++

#**257** October 5th 04, 07:59 PM

In comp.arch Felger Carbon wrote:
"Kees van Reeuwijk" wrote in message
...
Felger Carbon wrote:

We have long had desktop SMP available. Question: what legacy
software runs faster on two cores (whether on one or two chips)
than
on one? Answer: none.

Photoshop, and probably most other professional audio/video/graphics
programs, especially for Apple.

Bzzt! This is an IBM.PC NG.

comp.arch is not a ibm.pc ng, so bzzt yourself, please.

Oh, and make. I bought a SMP Linux computer years ago for the
specific
purpose of running my compiler testsuite in half the time.
Parallelization was done with make.

"Make" is run on workstations. It is not a legacy application for
personal computers.

you are worng here again - make even used to come with DOS, and many
legacy copies from various development packages for dos abound.

--
Sander

+++ Out of cheese error +++

#**258** October 5th 04, 10:19 PM

"Stephen Sprunk" wrote in message
news:1096991350./4f0QCaPR7ArQ+mUkxYEsQ@teranews...
"Stephen Fuld" wrote in message
...
But if you compare different cores, the more complex ones for the SMT
(excluding the extra complexity of the SMT) versus a simpler one for the
CMP, then you complicate the comparison by not comparing apples to
apples. How much of the difference is the SMT vs CMP and how much is the
difference in cores? One presumes the more complex core performs better
than the simpler one (or why do the complex one). Besides, if the SMT
die area penalty is in the 10% range that many have been quoting, can you
do the "simpler" core in almost exactly 55% of the die area of the
complex one? Once you change the core, you change the comparison such
that I maintain that it isn't the same comparison any more and my
original comment holds.

IBM claimed 25% die size increase to add SMT to Power5, and a -5% to +24%
performance gain, i.e. the performance increase _never_ matches up to the
size increase.

I don't remember those figures from IBM. I am not doubting you at all -
just my memory. But 25% dia area penalty does seem high to me based on what
others have said..

Also, it isn't clear that this some version of the Power core that takes 63%
of the size of the non-smt power one with which one could do Nick's
comparison. I certainly agree that if you have twice the silicon space than
CMP may out perform SMT, but if you can only "afford" less than a 100%
penalty, then SMT may make sense.

Traditional SMP costs 100% more in die size and, in the case of Opteron,
and never exceeds 100% performance gain except in corner cases. CMP will
require slightly less die (and significantly less in system costs) but
will likely be offset by memory contention of having half the number of
memory channels.

Yes - agreed.

Looks like apples and apples so far.

But Nick specified equal die area for his comparison.

--
- Stephen Fuld
e-mail address disguised to prevent spam

S

--
Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

#**259** October 6th 04, 04:16 AM

"Stephen Sprunk" wrote in message news:1096991350./4f0QCaPR7ArQ+mUkxYEsQ@teranews...
"Stephen Fuld" wrote in message
...
But if you compare different cores, the more complex ones for the SMT
(excluding the extra complexity of the SMT) versus a simpler one for the
CMP, then you complicate the comparison by not comparing apples to apples.
How much of the difference is the SMT vs CMP and how much is the
difference in cores? One presumes the more complex core performs better
than the simpler one (or why do the complex one). Besides, if the SMT die
area penalty is in the 10% range that many have been quoting, can you do
the "simpler" core in almost exactly 55% of the die area of the complex
one? Once you change the core, you change the comparison such that I
maintain that it isn't the same comparison any more and my original
comment holds.

IBM claimed 25% die size increase to add SMT to Power5, and a -5% to +24%
performance gain, i.e. the performance increase _never_ matches up to the
size increase.

Where did you get those figures from? Either your memory is playing
tricks on you or IBM were very conservative with their estimates. If
IBM originally claimed those numbers, they are definitely much below
what they are getting in benchmarks.

Check: http://www.redbooks.ibm.com/redpieces/pdfs/sg245768.pdf
p128. Increase of 10-50% with SMT on some industry standard
benchmarks, with an average of around 30-40%.

Traditional SMP costs 100% more in die size and, in the case of Opteron, and
never exceeds 100% performance gain except in corner cases. CMP will
require slightly less die (and significantly less in system costs) but will
likely be offset by memory contention of having half the number of memory
channels.

Looks like apples and apples so far.

S

#**260** October 6th 04, 11:18 AM

"Felger Carbon" writes:

We have long had desktop SMP available. Question: what legacy
software runs faster on two cores (whether on one or two chips) than
on one? Answer: none.

Anything run with X*, ie almost everything in existance.

Old tests I did long ago showed 30% increase was the minimum you
should expect.

* `with X' includes xterms and the like.

--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Intel Prescott CPU in a Nutshell	LuvrSmel	Overclocking	1	January 10th 05 03:23 PM
Intel chipsets are the most stable?	Grumble	Homebuilt PC's	101	October 26th 04 02:53 AM
Real World Comparisons: AMD 3200 -vs- Intel 3.2. Your thoughts, experiences....	Ted Grevers	General	33	February 6th 04 02:34 PM
Intel & 65nm	Yousuf Khan	General	0	November 25th 03 01:18 AM
Intel Updates Plans Again: Adds Pentium 4 EE at 3.40GHz and Pentium 4 at 3.40GHz	lyon_wonder	General	2	November 10th 03 11:17 PM