If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#221
|
|||
|
|||
In comp.sys.ibm.pc.hardware.chips Peter Boyle wrote:
On Mon, 4 Oct 2004, Robert Redelmeier wrote: Code type matters. SMT is best for continuing work during the ~300 clock memory fetch latency. What is the evidence to back up this claim? Logic. When else can SMT really do net increased work? If you want to test, run some pointer-chasers. I would however claim that functional units are almost free, This is getting more and more true as caches grow, but only from an areal perspective. A multiplier still sucks back a huge amount of power and tosses it as heat. CMP will also help the former [bandwidth]. Nope, not without a second memory bus and all those pins. -- Robert |
#222
|
|||
|
|||
In article k, Peter Boyle writes: | On Mon, 4 Oct 2004, Robert Redelmeier wrote: | | Code type matters. SMT is best for continuing work during | the ~300 clock memory fetch latency. | | What is the evidence to back up this claim? | | Not theories, but _evidence_ of bigger speed up compared to, | for example, switch on event multi-threading, or CMP with simpler | and smaller processors, but not sharing L1 cache. | | Note that I'm not claiming evidence the other way, but as far as | I can tell the jury is out on the best organisation for concurrency | on chip. I should be happy to see even a theoretical analysis - I wasn't impressed by Eggers's omission of a comparable CMP for comparison purposes. Regards, Nick Maclaren. |
#223
|
|||
|
|||
In article , "Peter Dickerson" writes: | | I think that Nick is muddled on this one. If the base implementation is | already OoO then there will normally be many more physical registers than | architected ones. To go two-way SMT may not involve adding any physical | registers, but rather involve changes to renaming. "dual port" every | execution unit doesn't make much sense to me. Access to execution units from | either virtual processor is essentially free - they are after all virtual | processors, not real. What is required is that every bit of *architected* | processor state be renamed or duplicated, prehaps that's what Nick is | getting at? You haven't allowed for the problem of access. A simple (CMP) duplication doesn't increase the connectivity, and can be done more-or-less by replicating a single core; SMT does, and may need the linkages redesigning. This might be a fairly simple task for 2-way SMT, though there have been reports that it isn't even for that, but consider it for 8-way. | You have to mangle any performance counters and many privileged | registers fairly horribly, because their meanings and constraints | change. Similarly, you have to add logic for CPU state change | synchronisation, because some changes must affect only the current | thread and some must affect both. And you have to handle the case | of the two threads attempting incompatible operations simultaneously. | | What operations are incompatible. SMT as implemented in the Pentium 4, say, | allows either virtual processor to do what it likes. One can transition from | user to kernel and back while the other services interrupts or exceptions or | whatever. The only coordination needed for proper operation is what is | needed for two processors - of course the performance may suffer though. ABSOLUTELY NOT! Look at the performance counters, think of floating-point modes (in SMT, they may need to change for each operation), think of quiescing the other CPU (needed for single to dual thread switching), think of interrupts (machine check needs one logic, and underflow another). In ALL cases, on two CPUs, each can operate independently, but SMT threads can't. | yes lots of speculation. The difference here is that to CMP processor take | about twice the silicon of one, while with SMT you have the option to use | 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy SMT | will die because its more effort than its worth, but my bet is that chips | will go both routes with SMT and CMP. Just one more little problem for the | OS developers to deal with I am referring to the fair comparison between a 2-way SMT and a dual-core CMP using the same amount of silicon, power etc. THAT is what should have been compared - but I can find no evidence that it was (though it probably was). Regards, Nick Maclaren. |
#224
|
|||
|
|||
"Nick Maclaren" wrote in message
... In article , "Peter Dickerson" writes: | | I think that Nick is muddled on this one. If the base implementation is | already OoO then there will normally be many more physical registers than | architected ones. To go two-way SMT may not involve adding any physical | registers, but rather involve changes to renaming. "dual port" every | execution unit doesn't make much sense to me. Access to execution units from | either virtual processor is essentially free - they are after all virtual | processors, not real. What is required is that every bit of *architected* | processor state be renamed or duplicated, prehaps that's what Nick is | getting at? You haven't allowed for the problem of access. A simple (CMP) duplication doesn't increase the connectivity, and can be done more-or-less by replicating a single core; SMT does, and may need the linkages redesigning. This might be a fairly simple task for 2-way SMT, though there have been reports that it isn't even for that, but consider it for 8-way. I think we must be talking at cross purposes because to me an 8-way SMT is very little different from a 2-way. Bigger register files for architected state and a few more bits into the renamer. I don't know what you mean by linkages in this context. Linkages between what and what? | You have to mangle any performance counters and many privileged | registers fairly horribly, because their meanings and constraints | change. Similarly, you have to add logic for CPU state change | synchronisation, because some changes must affect only the current | thread and some must affect both. And you have to handle the case | of the two threads attempting incompatible operations simultaneously. | | What operations are incompatible. SMT as implemented in the Pentium 4, say, | allows either virtual processor to do what it likes. One can transition from | user to kernel and back while the other services interrupts or exceptions or | whatever. The only coordination needed for proper operation is what is | needed for two processors - of course the performance may suffer though. ABSOLUTELY NOT! Look at the performance counters, think of floating-point modes (in SMT, they may need to change for each operation), think of quiescing the other CPU (needed for single to dual thread switching), think of interrupts (machine check needs one logic, and underflow another). In ALL cases, on two CPUs, each can operate independently, but SMT threads can't. I don't see this at all. I'm not saying these things are trivial, I'm saying that most of it has to be done for a single threaded OoO CPU too. | yes lots of speculation. The difference here is that to CMP processor take | about twice the silicon of one, while with SMT you have the option to use | 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy SMT | will die because its more effort than its worth, but my bet is that chips | will go both routes with SMT and CMP. Just one more little problem for the | OS developers to deal with I am referring to the fair comparison between a 2-way SMT and a dual-core CMP using the same amount of silicon, power etc. THAT is what should have been compared - but I can find no evidence that it was (though it probably was). While I'm looking at the cost of make a single threaded OoO CPU into a multithreaded one. Thatn probably explains much of the disparity above. If I had enough silcon for two OoO CPU's I'd probable take the extra hit (5%-30%, or whatever) to add SMT to each core. If the game is how to get the max performance (by some measure) from a given area of silicon then I'd have to know how big it is - if its just too small for two seperate cores... Regards, Nick Maclaren. Peter |
#225
|
|||
|
|||
| yes lots of speculation. The difference here is that to CMP processor take
| about twice the silicon of one, while with SMT you have the option to use | 1.5 cores worth of silicon. Perhaps once dual cores is cheap and easy SMT | will die because its more effort than its worth, but my bet is that chips | will go both routes with SMT and CMP. Just one more little problem for the | OS developers to deal with I am referring to the fair comparison between a 2-way SMT and a dual-core CMP using the same amount of silicon, power etc. THAT is what should have been compared - but I can find no evidence that it was (though it probably was). Regards, Nick Maclaren. And for those who don't know, I make more detailed explanation. the most papers on the matter seem to speak about same number of execution units, BUT in reality large part of the area NČ areal complexity so you in REALITY could get 1.4 or 1.33 with 6%% SMT overhead number of execution resources EXCEPT cache for same area for SINGLE compared to TWO processors. And doubling cache typically gives 10% increase overall. So thats not a disadvantage for CMP, version that DON'T share caches, on the contrary it reduces cache conflicts. And those 1.3 times execution resources don't mean 1.3 times single threaded performance as MOST time you could use only small portion of resources, for single thread. So 2 core CMP for instance vs SMT would be 6 way VS 8 way for SMT, for similar area... And for having separate caches wouldn't hurt too much. Especially if other CPU could have quickly access on 2nd CPU L2 ... With OWN set of L2 tags for it, and without updating the other CPU:s L2$ LRU state and shared victim cache .... Now you have about twice the cache bandwith, and less cache latency, and can avoid strange cache conflicts between threads. Besides shared L2 I$ helps on I$ hit rate... Yes there needs to be balancing between having more cores, and more powerful cores, but current papers on matter penalize CMP because they don't take in account any kind of design trade offs that make CMP machine have MORE execution resources in total, and less cache contention and lower latencies on a hit. Jouni Osmala |
#226
|
|||
|
|||
In article , "Peter Dickerson" writes: | | You haven't allowed for the problem of access. A simple (CMP) | duplication doesn't increase the connectivity, and can be done | more-or-less by replicating a single core; SMT does, and may need | the linkages redesigning. This might be a fairly simple task for | 2-way SMT, though there have been reports that it isn't even for | that, but consider it for 8-way. | | I think we must be talking at cross purposes because to me an 8-way SMT is | very little different from a 2-way. Bigger register files for architected | state and a few more bits into the renamer. I don't know what you mean by | linkages in this context. Linkages between what and what? Between the register file and the execution units, and between execution units. The point is the days when 'wiring' was cheap are no more - at least according to every source I have heard! | Look at the performance counters, think of floating-point modes | (in SMT, they may need to change for each operation), think of | quiescing the other CPU (needed for single to dual thread switching), | think of interrupts (machine check needs one logic, and underflow | another). In ALL cases, on two CPUs, each can operate independently, | but SMT threads can't. | | I don't see this at all. I'm not saying these things are trivial, I'm saying | that most of it has to be done for a single threaded OoO CPU too. No, they don't. Take performance counters. In an OoO CPU, you have a single process and single core, so you accumulate the counter and, at context switch, update the process state. With SMT, you have multiple processes and multiple cores - where does the time taken (or events occurring) in an execution unit get assigned to? The Pentium 4 kludges this horribly. Consider mode switching. In an OoO CPU, a typical mode switch is a synchronisation point, and is reset on a context switch. With SMT, a mode must be per-thread (which was said by hardware people to be impossible a decade ago). Consider interrupt handling. Underflow etc. had better be handled within its thread, because the other might be non-interruptible (and think scalability). But you had BETTER not handle all machine checks like that (such as ones that disable an execution unit, in a high-RAS design), as the execution units are in common. Consider quiescing the other CPU to switch between single and dual thread mode, to handle a machine check or whatever. You had BETTER ensure that both CPUs don't do it at once .... Regards, Nick Maclaren. |
#227
|
|||
|
|||
"Robert Redelmeier" wrote in message ... In comp.sys.ibm.pc.hardware.chips Peter Boyle wrote: On Mon, 4 Oct 2004, Robert Redelmeier wrote: Code type matters. SMT is best for continuing work during the ~300 clock memory fetch latency. What is the evidence to back up this claim? Logic. When else can SMT really do net increased work? If you want to test, run some pointer-chasers. I would however claim that functional units are almost free, This is getting more and more true as caches grow, but only from an areal perspective. A multiplier still sucks back a huge amount of power and tosses it as heat. Which is an argument for SMT over CMP. With SMP, you can "share" one multiplier between the two threads (assuming they are not both heavy users of multiply - which is true for lots of server type workloads), wheras a CMP would require two multipliers with all the power and heat issues that implies. -- - Stephen Fuld e-mail address disguised to prevent spam |
#228
|
|||
|
|||
"Nick Maclaren" wrote in message ... snip I am referring to the fair comparison between a 2-way SMT and a dual-core CMP using the same amount of silicon, power etc. THAT is what should have been compared - but I can find no evidence that it was (though it probably was). Probably because it can't be done. I think virtually everyone here believes that the extra silicon area for a two way SMP is much less than 100% of the die area of the core. Thus a two way SMP will use less die area, power, etc. than a two way CMP and the comparison that you specify can't be done. Let me repeat, I am not an SMP bigot. It seems to me that it is a usefull tool, along with others, including CMP in the designers tool box. As someone else has said, I expect the future to be combinations of both, along with multiple chips per PCB and multiple PCBs per system. -- - Stephen Fuld e-mail address disguised to prevent spam |
#229
|
|||
|
|||
|
#230
|
|||
|
|||
In article ,
Stephen Fuld wrote: "Nick Maclaren" wrote in message ... I am referring to the fair comparison between a 2-way SMT and a dual-core CMP using the same amount of silicon, power etc. THAT is what should have been compared - but I can find no evidence that it was (though it probably was). Probably because it can't be done. I think virtually everyone here believes that the extra silicon area for a two way SMP is much less than 100% of the die area of the core. Thus a two way SMP will use less die area, power, etc. than a two way CMP and the comparison that you specify can't be done. Let me repeat, I am not an SMP bigot. It seems to me that it is a usefull tool, along with others, including CMP in the designers tool box. As someone else has said, I expect the future to be combinations of both, along with multiple chips per PCB and multiple PCBs per system. In the above, you mean SMT, I assume. It's been possible for at least 5 years, probably 10. Yes, the cores of a CMP system would necessarily be simpler, but it becomes possible as soon as the transistor count of the latest and greatest model in the range exceeds doubt that of the simplest. Well, roughly, and allowing for the difference between code and data transistors. Regards, Nick Maclaren. |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Intel Prescott CPU in a Nutshell | LuvrSmel | Overclocking | 1 | January 10th 05 03:23 PM |
Intel chipsets are the most stable? | Grumble | Homebuilt PC's | 101 | October 26th 04 02:53 AM |
Real World Comparisons: AMD 3200 -vs- Intel 3.2. Your thoughts, experiences.... | Ted Grevers | General | 33 | February 6th 04 02:34 PM |
Intel & 65nm | Yousuf Khan | General | 0 | November 25th 03 01:18 AM |
Intel Updates Plans Again: Adds Pentium 4 EE at 3.40GHz and Pentium 4 at 3.40GHz | lyon_wonder | General | 2 | November 10th 03 11:17 PM |