If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#31
|
|||
|
|||
AMD to leave x86 behind?
Tim McCaffrey wrote: Instructions that load/store/copy memory that control how much cache pollution is done and/or communicate to the bridges and I/O devices how much data is being loaded/stored could improve efficiency on the I/O (PCI) and memory busses. Ahhh.. Maybe you should look at the MTRR's (memory Type range registers) PAT (Page Attribute Table) and the prefetch and Non-temporal instructions already provided. I've seen near therotical throughput numbers both on the memory subsystem and on PCI busses given properly tuned code. Its not that you can't control such things with the x86 its just that I haven't seen a compiler generate optimal code. AMD had a very nice document they wrote a few years ago about how to get max throughput with memory copy operations, where they compared diffrent methods and instructions for doing the memory copy. If I remember correctly in the end they got nearly theoritical bandwidth numbers by doing a simple loop to preread (with actual register load instead of prefretch) cache block size reads followed by another loop accually doing a Non Temporal quadword copy. This reduced the read vs write bus turnaround times enough to get numbers that were significantly faster than nearly any other method. So, its possible, right now given proper code. |
#32
|
|||
|
|||
AMD to leave x86 behind?
Jeremy Linton wrote:
AMD had a very nice document they wrote a few years ago about how to get max throughput with memory copy operations, where they compared diffrent methods and instructions for doing the memory copy. If I remember correctly in the end they got nearly theoritical bandwidth numbers by doing a simple loop to preread (with actual register load instead of prefretch) cache block size reads followed by another loop accually doing a Non Temporal quadword copy. This reduced the read vs write bus turnaround times enough to get numbers that were significantly faster than nearly any other method. Afair, that optimization was in regard to doing a simple set of fp operations on a block of data, where it turned out that the fastest way was to move everything three times: First the max speed pre-read loop, then an operate loop, storing to a fixed half L1 sized buffer, then finally NT stores to move the result block to the final destination. So, its possible, right now given proper code. Or in this case, quite horribly overcomplicated code. :-( Terje -- - "almost all programming can be viewed as an exercise in caching" |
#33
|
|||
|
|||
AMD to leave x86 behind?
On Mon, 31 Oct 2005 18:04:01 GMT, Jeremy Linton writes:
Ahhh.. Maybe you should look at the MTRR's (memory Type range registers) PAT (Page Attribute Table) and the prefetch and Non-temporal instructions already provided. I've seen near therotical throughput numbers both on the memory subsystem and on PCI busses given properly tuned code. Its not that you can't control such things with the x86 its just that I haven't seen a compiler generate optimal code. AMD had a very nice document they wrote a few years ago about how to get max throughput with memory copy operations, where Would you happen to know the URL? I'd like to read this document. This reduced the read vs write bus turnaround times enough to get numbers that were significantly faster than nearly any other method. In particular, for this memory turnaround effect you've mentioned? Scott |
#34
|
|||
|
|||
AMD to leave x86 behind?
David Kanter wrote:
how about some form of SMT for AMD? I don't know that might come too, but it can't be done as easily as Hyperthreading. Hyperthreading relied on the Pentium 4's inherent inefficiency to run a lot of threads simultaneously. If you think that any modern MPU is efficient, you are smoking crack. They all have plenty of unused cycles left on the table (except when running linpack). But the secret is to have enough idle cycles to run both threads at close to full speed each. I'd say anything that had enough to run both threads at 80% full speed, was a reasonably successful SMT. |
#35
|
|||
|
|||
AMD to leave x86 behind?
Stephen Fuld wrote:
Is there some technical reason behind the limitation to three HT links or was it a marketing decision? If the latter, then it doesn't seem like it would be a big deal, if larger systems seems to be a bigger market, to add another link (or even two). The HT links must be a pretty small amount of silicon and a small number of pins. Does that make sense? I don't think there was any technical or marketing reason behind limiting it to 3 HTT links per processor. It may have simply been a "we need to keep the number HTT links and their pin counts within a reasonable amount"-type decision. I'm sure they can add even more HT links in the future. Yousuf Khan |
#36
|
|||
|
|||
AMD to leave x86 behind?
Oliver S. wrote:
If it added instructions to explicitly prefetch data from another processor then it would probably have a gain in performance. These instructions wouldn't work better than the prefetching-instructions currently implemented. I think it would be cleverer to copy hw-scouting from Sun's upcoming CPUs. HW-scouting is simple to implement if you're going to have a SMT-core anyway. So what's HW-scouting? Yousuf Khan |
#37
|
|||
|
|||
AMD to leave x86 behind?
David Hopwood wrote:
Rob Stow wrote: In an eight-way system most are one hop away, while a few are two hops away. No again. This would be the ideal 8P Opty 8xx scheme: CPU6-----------------CPU7 | \ / | | \ / | | CPU4------CPU5 | | | | | | | | | | CPU2------CPU[3] | | / \ | | / \ | CPU0 CPU1 | | | | Chipset Chipset Hence, there are 11 one-hops, 12 two-hops, and 5 three-hops. That's not optimal: CPU6--------------CPU7 | \_____ ____/ | | \ / | | X | | / \ | | CPU4---CPU5 | | | | | | | | | | CPU2---CPU3 | | / \ | | / \ | CPU0 CPU1 | | | | Chipset Chipset 11 one-hops, 16 two-hops, and 1 three-hop. I don't get it, your diagram seems to be only a different permutation of Rob's diagram. The only difference, in yours is that you got CPU5 connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't represent a shortcut, it represents one line going over the other but not touching. Listing all of the 3 hop combinations in yours and Rob's, this is what I get. Rob: 1: CPU0-CPU1: 0-2-3-1 2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5 3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4 4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7 5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6 David: 1: CPU0-CPU1: 0-2-3-1 2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5 3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4 4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6 5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7 Only #4 & #5 are different between your two respective diagrams. Yousuf Khan |
#38
|
|||
|
|||
AMD to leave x86 behind?
Oliver S. wrote:
And let's say it'll have 32 FP registers instead of just 16 like SSE does. Of course 32 registers would be better than 16, but I think we're well behind a critical point with 16 fp-registers. I think these large regis- ter-sets we see today on newer architectures exist rather because they are easy to implement in a cpu than because of their necessity; in dif- ferent words: the benefit of 32 or more registers isn't very high in most cases, but their cost in terms of the chip-design is rather low when your register-file shouldn't become too large. Can't disagree with that. Yousuf Khan |
#39
|
|||
|
|||
AMD to leave x86 behind?
Yousuf Khan wrote:
David Hopwood wrote: Rob Stow wrote: In an eight-way system most are one hop away, while a few are two hops away. No again. This would be the ideal 8P Opty 8xx scheme: CPU6-----------------CPU7 | \ / | | \ / | | CPU4------CPU5 | | | | | | | | | | CPU2------CPU[3] | | / \ | | / \ | CPU0 CPU1 | | | | Chipset Chipset Hence, there are 11 one-hops, 12 two-hops, and 5 three-hops. That's not optimal: CPU6--------------CPU7 | \_____ ____/ | | \ / | | X | | / \ | | CPU4---CPU5 | | | | | | | | | | CPU2---CPU3 | | / \ | | / \ | CPU0 CPU1 | | | | Chipset Chipset 11 one-hops, 16 two-hops, and 1 three-hop. I don't get it, your diagram seems to be only a different permutation of Rob's diagram. The only difference, in yours is that you got CPU5 connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't represent a shortcut, it represents one line going over the other but not touching. Listing all of the 3 hop combinations in yours and Rob's, this is what I get. Rob: 1: CPU0-CPU1: 0-2-3-1 2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5 3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4 4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7 5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6 David: 1: CPU0-CPU1: 0-2-3-1 2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5 3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4 4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6 5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7 Only #4 & #5 are different between your two respective diagrams. Yousuf Khan The cross-over gives short-cuts for #2 to #5 (#0, CPU0-CPU1, is still a 3 hop): 2: CPU0-CPU5: 0-6-5 3: CPU1-CPU4: 1-7-4 4: CPU2-CPU6: 2-0-6 5: CPU3-CPU7: 3-1-7 mvh., David |
#40
|
|||
|
|||
AMD to leave x86 behind?
Yousuf Khan wrote:
David Hopwood wrote: That's not optimal: CPU6--------------CPU7 | \_____ ____/ | | \ / | | X | | / \ | | CPU4---CPU5 | | | | | | | | | | CPU2---CPU3 | | / \ | | / \ | CPU0 CPU1 | | | | Chipset Chipset 11 one-hops, 16 two-hops, and 1 three-hop. I don't get it, your diagram seems to be only a different permutation of Rob's diagram. The only difference, in yours is that you got CPU5 connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't represent a shortcut, it represents one line going over the other but not touching. Listing all of the 3 hop combinations in yours and Rob's, this is what I get. Rob: 1: CPU0-CPU1: 0-2-3-1 2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5 3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4 4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7 5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6 David: 1: CPU0-CPU1: 0-2-3-1 2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5 3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4 4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6 5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7 Only #4 & #5 are different between your two respective diagrams. I think you've missed a key feature of that cross: 2: CPU0-CPU5: 0-6-5 3: CPU1-CPU4: 1-7-4 4: CPU2-CPU6: 2-0-6 5: CPU3-CPU7: 3-1-7 I.e. only the CPU0-CPU1 link has to pass over three hops. Terje -- - "almost all programming can be viewed as an exercise in caching" |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Should I leave my printers on? | OM | Printers | 22 | August 8th 05 10:50 PM |
Please leave in garage? | John Hardaker | UK Computer Vendors | 1 | May 14th 05 07:34 PM |
Leave Dell 4600 PC Always On? | Filipo | General | 6 | September 15th 04 01:21 AM |
Turn printer off or leave it on? | Walter R. | Printers | 4 | February 29th 04 08:18 PM |
Should I leave well enough alone? | Ken Fox | Overclocking | 1 | January 25th 04 12:34 AM |