AMD to leave x86 behind?

#31 October 31st 05, 06:04 PM

Tim McCaffrey wrote:
Instructions that load/store/copy memory that control how much
cache pollution is done and/or communicate to the bridges and
I/O devices how much data is being loaded/stored could improve
efficiency on the I/O (PCI) and memory busses.
Ahhh.. Maybe you should look at the MTRR's (memory Type range
registers) PAT (Page Attribute Table) and the prefetch and Non-temporal
instructions already provided. I've seen near therotical throughput
numbers both on the memory subsystem and on PCI busses given properly
tuned code. Its not that you can't control such things with the x86 its
just that I haven't seen a compiler generate optimal code.

AMD had a very nice document they wrote a few years ago about how to
get max throughput with memory copy operations, where they compared
diffrent methods and instructions for doing the memory copy. If I
remember correctly in the end they got nearly theoritical bandwidth
numbers by doing a simple loop to preread (with actual register load
instead of prefretch) cache block size reads followed by another loop
accually doing a Non Temporal quadword copy. This reduced the read vs
write bus turnaround times enough to get numbers that were significantly
faster than nearly any other method.

So, its possible, right now given proper code.

#32 October 31st 05, 08:19 PM

Jeremy Linton wrote:
AMD had a very nice document they wrote a few years ago about how to
get max throughput with memory copy operations, where they compared
diffrent methods and instructions for doing the memory copy. If I
remember correctly in the end they got nearly theoritical bandwidth
numbers by doing a simple loop to preread (with actual register load
instead of prefretch) cache block size reads followed by another loop
accually doing a Non Temporal quadword copy. This reduced the read vs
write bus turnaround times enough to get numbers that were significantly
faster than nearly any other method.

Afair, that optimization was in regard to doing a simple set of fp
operations on a block of data, where it turned out that the fastest way
was to move everything three times:

First the max speed pre-read loop, then an operate loop, storing to a
fixed half L1 sized buffer, then finally NT stores to move the result
block to the final destination.

So, its possible, right now given proper code.

Or in this case, quite horribly overcomplicated code. :-(

Terje

--
-
"almost all programming can be viewed as an exercise in caching"

#33 October 31st 05, 09:23 PM

On Mon, 31 Oct 2005 18:04:01 GMT, Jeremy Linton writes:

Ahhh.. Maybe you should look at the MTRR's (memory Type range
registers) PAT (Page Attribute Table) and the prefetch and
Non-temporal instructions already provided. I've seen near
therotical throughput numbers both on the memory subsystem and
on PCI busses given properly tuned code. Its not that you
can't control such things with the x86 its just that I haven't
seen a compiler generate optimal code.

AMD had a very nice document they wrote a few years ago about
how to get max throughput with memory copy operations, where

Would you happen to know the URL? I'd like to read this document.

This reduced the read vs write bus turnaround times enough to
get numbers that were significantly faster than nearly any
other method.

In particular, for this memory turnaround effect you've mentioned?

Scott

#34 November 1st 05, 06:27 AM

David Kanter wrote:
how about some form of SMT for AMD?

I don't know that might come too, but it can't be done as easily as
Hyperthreading. Hyperthreading relied on the Pentium 4's inherent
inefficiency to run a lot of threads simultaneously.

If you think that any modern MPU is efficient, you are smoking crack.
They all have plenty of unused cycles left on the table (except when
running linpack).

But the secret is to have enough idle cycles to run both threads at
close to full speed each. I'd say anything that had enough to run both
threads at 80% full speed, was a reasonably successful SMT.

#35 November 1st 05, 06:38 AM

Stephen Fuld wrote:
Is there some technical reason behind the limitation to three HT links or
was it a marketing decision? If the latter, then it doesn't seem like it
would be a big deal, if larger systems seems to be a bigger market, to add
another link (or even two). The HT links must be a pretty small amount of
silicon and a small number of pins. Does that make sense?

I don't think there was any technical or marketing reason behind
limiting it to 3 HTT links per processor. It may have simply been a "we
need to keep the number HTT links and their pin counts within a
reasonable amount"-type decision. I'm sure they can add even more HT
links in the future.

Yousuf Khan

#36 November 1st 05, 06:39 AM

Oliver S. wrote:
If it added instructions to explicitly prefetch data from another
processor then it would probably have a gain in performance.

These instructions wouldn't work better than the prefetching-instructions
currently implemented. I think it would be cleverer to copy hw-scouting
from Sun's upcoming CPUs. HW-scouting is simple to implement if you're
going to have a SMT-core anyway.

So what's HW-scouting?

Yousuf Khan

#37 November 1st 05, 07:10 AM

David Hopwood wrote:
Rob Stow wrote:

In an eight-way system most
are one hop away, while a few are two hops away.

No again. This would be the ideal 8P Opty 8xx scheme:

CPU6-----------------CPU7
| \ / |
| \ / |
| CPU4------CPU5 |
| | | |
| | | |
| CPU2------CPU[3] |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset

Hence, there are 11 one-hops, 12 two-hops, and 5 three-hops.

That's not optimal:

CPU6--------------CPU7
| \_____ ____/ |
| \ / |
| X |
| / \ |
| CPU4---CPU5 |
| | | |
| | | |
| CPU2---CPU3 |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset

11 one-hops, 16 two-hops, and 1 three-hop.

I don't get it, your diagram seems to be only a different permutation of
Rob's diagram. The only difference, in yours is that you got CPU5
connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to
CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't
represent a shortcut, it represents one line going over the other but
not touching.

Listing all of the 3 hop combinations in yours and Rob's, this is what I
get.

Rob:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7
5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6

David:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6
5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7

Only #4 & #5 are different between your two respective diagrams.

Yousuf Khan

#38 November 1st 05, 07:18 AM

Oliver S. wrote:
And let's say it'll have 32 FP registers instead of just 16 like SSE
does.

Of course 32 registers would be better than 16, but I think we're well
behind a critical point with 16 fp-registers. I think these large regis-
ter-sets we see today on newer architectures exist rather because they
are easy to implement in a cpu than because of their necessity; in dif-
ferent words: the benefit of 32 or more registers isn't very high in
most cases, but their cost in terms of the chip-design is rather low
when your register-file shouldn't become too large.

Can't disagree with that.

Yousuf Khan

#39 November 1st 05, 07:48 AM

Yousuf Khan wrote:
David Hopwood wrote:

Rob Stow wrote:

In an eight-way system most
are one hop away, while a few are two hops away.

No again. This would be the ideal 8P Opty 8xx scheme:

CPU6-----------------CPU7
| \ / |
| \ / |
| CPU4------CPU5 |
| | | |
| | | |
| CPU2------CPU[3] |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset

Hence, there are 11 one-hops, 12 two-hops, and 5 three-hops.

That's not optimal:

CPU6--------------CPU7
| \_____ ____/ |
| \ / |
| X |
| / \ |
| CPU4---CPU5 |
| | | |
| | | |
| CPU2---CPU3 |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset

11 one-hops, 16 two-hops, and 1 three-hop.

I don't get it, your diagram seems to be only a different permutation of
Rob's diagram. The only difference, in yours is that you got CPU5
connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to
CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't
represent a shortcut, it represents one line going over the other but
not touching.

Listing all of the 3 hop combinations in yours and Rob's, this is what I
get.

Rob:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7
5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6

David:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6
5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7

Only #4 & #5 are different between your two respective diagrams.

Yousuf Khan

The cross-over gives short-cuts for #2 to #5 (#0, CPU0-CPU1, is still a
3 hop):

2: CPU0-CPU5: 0-6-5
3: CPU1-CPU4: 1-7-4
4: CPU2-CPU6: 2-0-6
5: CPU3-CPU7: 3-1-7

mvh.,

David

#40 November 1st 05, 09:42 AM

Yousuf Khan wrote:

David Hopwood wrote:

That's not optimal:

CPU6--------------CPU7
| \_____ ____/ |
| \ / |
| X |
| / \ |
| CPU4---CPU5 |
| | | |
| | | |
| CPU2---CPU3 |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset

11 one-hops, 16 two-hops, and 1 three-hop.

I don't get it, your diagram seems to be only a different permutation of
Rob's diagram. The only difference, in yours is that you got CPU5
connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to
CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't
represent a shortcut, it represents one line going over the other but
not touching.

Listing all of the 3 hop combinations in yours and Rob's, this is what I
get.

Rob:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7
5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6

David:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6
5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7

Only #4 & #5 are different between your two respective diagrams.

I think you've missed a key feature of that cross:

2: CPU0-CPU5: 0-6-5
3: CPU1-CPU4: 1-7-4
4: CPU2-CPU6: 2-0-6
5: CPU3-CPU7: 3-1-7

I.e. only the CPU0-CPU1 link has to pass over three hops.

Terje
--
-
"almost all programming can be viewed as an exercise in caching"

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Should I leave my printers on?	OM	Printers	22	August 8th 05 10:50 PM
Please leave in garage?	John Hardaker	UK Computer Vendors	1	May 14th 05 07:34 PM
Leave Dell 4600 PC Always On?	Filipo	General	6	September 15th 04 01:21 AM
Turn printer off or leave it on?	Walter R.	Printers	4	February 29th 04 08:18 PM
Should I leave well enough alone?	Ken Fox	Overclocking	1	January 25th 04 12:34 AM