If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#11
|
|||
|
|||
On Sun, 28 Dec 2003 18:08:07 GMT, Robert Redelmeier
wrote: In comp.sys.ibm.pc.hardware.chips Robert Myers wrote: My only point was that latency still matters. Superficial examination of early results from the HP Superdome showed that Itanium is apparently not very tolerant of increased latency, and HP engineers Oh, fully agreed. For some apps, latency is _everything_ (linked-lists, TP dB). If the app hopscotches randomly thru RAM memory (SETI?) nothing else matters much. Modern systems have done wonders to deliver bandwidth. Dual channell DDR at high clocks. But has much been done to improve latency from ~130 ns? (old number) Close enough. I thought the main idea behind on-CPU memory controllers was to reduce this to ~70 ns by reduced bufferin/queuing. And it has. A smart hub might be able to detect patterns like 2-4-6-8, 4-8-16-20-24 or 5-4-3-2-1 but cannot possibly do anything with data-driven pseudo-randoms except add latency. As David Wang has shrewdly observed, you lose alot of information once you are outside the processor. All you have left is the history of memory requests. How about a Bayesian network to try to infer the underlying pattern? Lame joke. Doesn't even warrant a smiley. Itanium currently retires instructions in order. Sooner or later, Intel has to do something for Itanium other than to increase the cache size. Are you suggesting Out-of-Order retirement??? Intriguing possibility with a new arch. Just another example of what another poster in another group would call my non-standard use of language. I had already started using the word retirement and stuck with it for no better reason than that I had already started using it. I'm not making any bold new proposals for computer architecture. Just at the moment, my brain is frazzled from trying to consume an entire branch of mathematics in a very short time, so I wouldn't recognize a good new idea if I saw one. On the other hand, there is no reason why the only information to make it across a memory hub has to be memory requests, and there is no reason why the only thing a memory location knows about itself is that corresponds to a particular address. Of course, SMT is just a different solution -- keep the CPU busy with other work during the ~300 clock read stalls. Good if there are parallel threads/tasks. Useless if not. There is aonther way to use SMT, which is to excecute speculative slices, and there are papers in the literature for simulated-Itania that show a dramatic improvement. Until we see the details of how SMT is implemented in Montecito, it won't be obvious whether SMT can actually be used that way in Montecito or not. If it can, it goes a long way toward making up for a lack of true run-time scheduling, since the speculative slice, whose only purpose in life is to trigger memory requests, is operating in the actual run-time environment, not one assumed by the compiler. RM |
#12
|
|||
|
|||
Robert Myers wrote:
On Sun, 28 Dec 2003 19:26:26 GMT, "Yousuf Khan" wrote: "Robert Myers" wrote in message . .. The 6mb cache is an act of desperation on Intel's part. I don't _think_ their strategy is to keep increasing cache size. It's a losing strategy, anyway, unless you go to COMA. Itanium's in-order architecture is just too inflexible, and the problem is still cache misses. What's COMA? Cache-only memory architecture. The original Cray's were effectively COMA because Seymour used for main memory what everybody else used for cache. That's why some three-letter-agencies with no use for vector architectures bought the machines. Intel will, I gather, move the memory controller onto the die. Other than that, the strategy of the day (and for the forseeable future) is to hide latency, not to address it directly. Yes, but AMD is also proposing something similar, and they've already moved the memory controller onboard. Geez, Yousuf, not _everything_ is Intel vs. AMD. ;-). Sometimes a technical issue is just a technical issue. I cannot for the life of me get inside the head of whoever makes the technical calls at Intel, because Intel seems to want to do everything the hard way. Why, I do not know. I've wondered whether they're groping around for something they can patent -- obvious and previously tried solutions don't meet that criterion, leaving "the hard way" perhaps the preferred way from their standpoint. As it happens, Intel's bone-headed approach to computer architecture works well enough for the kinds of problems I am most interested in, which involve doing the same thing over and over again in ways that are stupefyingly predictable and you just want to find a way to do it very fast. I've often wondered if the secret of the origins of the Itanium architecture isn't that the engineers who designed it didn't adequately take into account that most of the world isn't doing technical computing. That, and the fact that nothing works really well for the applications that matters the most, which is OLTP (on-line transaction processing). Itanium happens to interest me also as an intellectual sandbox in which I can come to grips with things that may be completely obvious to some people, but not to me. It does well enough for the problems that interest me, and over the long haul, I expect Intel's bulldozer approach to architecture and marketing to win. Those things together are why you think I am an Itanium bigot. RM -- After being targeted with gigabytes of trash by the "SWEN" worm, I have concluded we must conceal our e-mail address. Our true address is the mirror image of what you see before the "@" symbol. It's a shame such steps are necessary. ...Charlie |
#13
|
|||
|
|||
"Robert Myers" wrote in message
... Of course, SMT is just a different solution -- keep the CPU busy with other work during the ~300 clock read stalls. Good if there are parallel threads/tasks. Useless if not. There is aonther way to use SMT, which is to excecute speculative slices, and there are papers in the literature for simulated-Itania that show a dramatic improvement. Until we see the details of how SMT is implemented in Montecito, it won't be obvious whether SMT can actually be used that way in Montecito or not. If it can, it goes a long way toward making up for a lack of true run-time scheduling, since the speculative slice, whose only purpose in life is to trigger memory requests, is operating in the actual run-time environment, not one assumed by the compiler. That's an interesting way of using SMT, but I suspect we won't see such a sophisticated use of SMT until at least 65nm, possibly 45nm. SMT in the form of P4's Hyperthreading was done without really adding too many transistors. However, it looks like any other architectures if they want to implement SMT will need to add to the transistor count. I hear that the IBM Power5 will implement SMT, and that it's added 25% to their transistor count. Probably more of a reflection about the IPC inefficiency of the P4 architecture than a reflection on Power5's. Yousuf Khan |
#14
|
|||
|
|||
On Sun, 28 Dec 2003 22:21:43 GMT, CJT wrote:
Robert Myers wrote: snip I cannot for the life of me get inside the head of whoever makes the technical calls at Intel, because Intel seems to want to do everything the hard way. Why, I do not know. I've wondered whether they're groping around for something they can patent -- obvious and previously tried solutions don't meet that criterion, leaving "the hard way" perhaps the preferred way from their standpoint. That is probably the correct explanation. RM |
#15
|
|||
|
|||
"CJT" wrote in message
... I've wondered whether they're groping around for something they can patent -- obvious and previously tried solutions don't meet that criterion, leaving "the hard way" perhaps the preferred way from their standpoint. Yeesh, if that were the case at Intel, I wonder if they have teams of managers reviewing and shooting down ideas that are too radical, yet not proprietary enough? :-) Yousuf Khan |
#16
|
|||
|
|||
"Yousuf Khan" wrote in message le.rogers.com... .... SMT in the form of P4's Hyperthreading was done without really adding too many transistors. However, it looks like any other architectures if they want to implement SMT will need to add to the transistor count. IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not sure which) was only a few percent. I hear that the IBM Power5 will implement SMT, and that it's added 25% to their transistor count. I've seen that number as well and am curious what it actually refers to (given the EV8 experience plus comments from the SMT researchers at UWash about the minimal added chip-area costs of SMT). It's possible that Px's use of instruction groups aggravates the problem, or that IBM is quoting the impact of side-effects rather than just SMT per se (e.g., additional cache to accommodate the increased use by having additional threads), or that IBM is referring only to the impact on the size of the processor core rather than to the overall chip area (which includes not only significant amounts of L2 cache but memory control and inter-chip routing logic plus, for P5, reportedly some kinds of on-chip offload engines for specific tasks). - bill |
#17
|
|||
|
|||
"Bill Todd" wrote in message
... IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not sure which) was only a few percent. Perhaps Alpha is closest in philosophy to P4, except from an earlier generation? That is high frequencies, but low IPCs. Afterall, Alpha was the Mhz king of processors for years prior to the crown being taken over by x86 processors. During Alpha's reign on the Mhz pile, its contemporaries (Sparc, MIPS, Power, PA-RISC, etc.) seemed to be relatively competitive still, despite not producing the high Mhz that Alpha did. I hear that the IBM Power5 will implement SMT, and that it's added 25% to their transistor count. I've seen that number as well and am curious what it actually refers to (given the EV8 experience plus comments from the SMT researchers at UWash about the minimal added chip-area costs of SMT). It's possible that Px's use of instruction groups aggravates the problem, or that IBM is quoting the impact of side-effects rather than just SMT per se (e.g., additional cache to accommodate the increased use by having additional threads), or that IBM is referring only to the impact on the size of the processor core rather than to the overall chip area (which includes not only significant amounts of L2 cache but memory control and inter-chip routing logic plus, for P5, reportedly some kinds of on-chip offload engines for specific tasks). I don't have that information, but I was also just working from the assumption that they were talking about a 25% increase in the size of just the inner core, not the overall die size. Yousuf Khan |
#18
|
|||
|
|||
"Yousuf Khan" wrote in message .cable.rogers.com... "Bill Todd" wrote in message ... IIRC the reported impact on EV8 (for either 4- or 8-way SMT, I'm not sure which) was only a few percent. Perhaps Alpha is closest in philosophy to P4, except from an earlier generation? That is high frequencies, but low IPCs. Nope. While that characterization might have had some validity in early Alphas, by the time EV6 appeared Alpha's IPC was competitive with anyone's (and better than most) - and, unfortunately, soon thereafter Compaq lost interest in pushing Alpha clock-rates up (after Capellas took over and reversed Pfeiffer's intention to market Alpha against the expected Itanic) so the *only* thing that kept Alpha ahead of the pack was its IPC (until it fell a full process generation behind as well more recently). EV8, by virtue of its 8-way issue and even greater number of in-flight instructions, would have had significantly better IPC than the rest of the world - leaving aside the impact of SMT on effective IPC. .... I hear that the IBM Power5 will implement SMT, and that it's added 25% to their transistor count. I've seen that number as well and am curious what it actually refers to (given the EV8 experience plus comments from the SMT researchers at UWash about the minimal added chip-area costs of SMT). It's possible that Px's use of instruction groups aggravates the problem, or that IBM is quoting the impact of side-effects rather than just SMT per se (e.g., additional cache to accommodate the increased use by having additional threads), or that IBM is referring only to the impact on the size of the processor core rather than to the overall chip area (which includes not only significant amounts of L2 cache but memory control and inter-chip routing logic plus, for P5, reportedly some kinds of on-chip offload engines for specific tasks). I don't have that information, but I was also just working from the assumption that they were talking about a 25% increase in the size of just the inner core, not the overall die size. Depending on what the EV8 percentages referred to, that might be possible: my impression is that the POWERx core itself is pretty compact. - bill |
#20
|
|||
|
|||
On Sun, 28 Dec 2003 18:08:07 GMT, Robert Redelmeier
wrote: In comp.sys.ibm.pc.hardware.chips Robert Myers wrote: My only point was that latency still matters. Superficial examination of early results from the HP Superdome showed that Itanium is apparently not very tolerant of increased latency, and HP engineers Oh, fully agreed. For some apps, latency is _everything_ (linked-lists, TP dB). If the app hopscotches randomly thru RAM memory (SETI?) nothing else matters much. Modern systems have done wonders to deliver bandwidth. Dual channell DDR at high clocks. But has much been done to improve latency from ~130 ns? (old number) Actually yes, though it wasn't anything obvious. More just that reducing latency was a major emphasis of current memory controllers. Intel and nVidia were the first to get it right, and they both did a bang-up job with their nForce2 and i875 chipsets respectively. Latency numbers have dropped down to ~100ns on both chipsets (though I've seen all sorts of different latency numbers depending on just how this is being measured). I thought the main idea behind on-CPU memory controllers was to reduce this to ~70 ns by reduced bufferin/queuing. On-chip memory controllers reduces latency in a few ways, and it works. Even with the greatly improved memory controllers from nVidia and Intel (and now SiS and VIA have more or less caught up), the Athlon64 and Opteron still have noticeably less latency. In fact, even with registered memory the Opteron has lower latency than a P4 with unbuffered memory. Unfortunately there is only so much that can be done here. When you get right down to it, DRAM has high latency, and nothing you do on the memory controller side of things can change that. The real solution to latency is to replace DRAM with something new. Itanium currently retires instructions in order. Sooner or later, Intel has to do something for Itanium other than to increase the cache size. Are you suggesting Out-of-Order retirement??? Intriguing possibility with a new arch. I think he's merely suggesting out-of-order execution. I don't know how well this would work with the IA-64 instruction set, but I suppose it should be possible. ------------- Tony Hill hilla underscore 20 at yahoo dot ca |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PC2100 versus PC2700 | Marc Guyott | Asus Motherboards | 3 | January 20th 05 02:32 AM |
Aggressive memory settings questions | Howie | Asus Motherboards | 4 | November 6th 04 07:29 PM |
I still don't completly understand FSB.... | legion | Homebuilt PC's | 7 | October 28th 04 03:20 AM |
"Out Of Memory error when trying to start a program or while program is running" | Dharmarajan.K | General Hardware | 0 | June 11th 04 10:42 PM |
Disk to disk copying with overclocked memory | JT | General | 30 | March 21st 04 02:22 AM |