If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#11
|
|||
|
|||
"David Wang" wrote in message ... The point here is that each/both threaded contexts will try to execute intructions past memory references that miss the cache. With only a single threaded context, the CPU can only go so deep before it runs out of instructions to execute, so it will stop issuing furthere memory references, even if those references are completely independent of the outstanding memory references. Having HT allows the CPU to execute down multiple threaded instruction streams, and there could well be more memory reference before both threads stall out completely waiting for some old reference to return with data. Right. This is just another case of a processor that is limited by the memory bandwidth. It's a tautology -- if the memory is the bottleneck, then the memory is the bottleneck and the instructions will proceed as fast as the memory can get the data to and from them. HT or no HT. Having HT give you multiple contexts within the same silicon so that there are more instructions to choose from, and as follows, possibly more (concurrent) memory references as well. The HT/SMT aspect of the processor will be able to (more easily) saturate the FSB/memory system. The total amount of memory accesses, conccurent or no, is limited by the memory bandwidth or the FSB bandwidth, whichever is less. The memory bandwidth is less. It's really that simple. If one logical processor is doing a lot of memory accesses and gets stalled and the other one doesn't need the memory bus, then the other logical processor can keep running. So again, one processor is held up due to memory speed and the other runs full tilt. How will a faster FSB help? The faster FSB is (presumed to be) tied to a memory system that supports higher memory bandwidth so that the queuing delays would be smaller on the 800 mbps FSB (because it has a higher BW memory system), as compared to the 533 mbps FSB on the Xeon platform. The delays are dominated by the memory timings. It will be memory speed that will make the difference, not FSB speed. Both threaded contexts can and does issue multiple memory references into the memory system even while the previous memory reference is still outstanding on the memory system. Yep, and if the memory is not the bottleneck for the CPU, then the FSB speed doesn't matter. All the data will be prefetched and lazily stored. But if the memory is the bottleneck, then a faster FSB again won't help because it won't make the memory faster. We're not on the same page here. The P4-HT platform is assumed to have a memory system with higher DRAM bandwidth. I never said DRAM bandwidth won't affect performance. I said FSB speed won't affect performance. Here's a hypothetical for you. You take the machine with the 533Mhz FSB and ratchet it up to 800Mhz leaving all else the same. How much do you think it's memory access performance will improve? I stand by my original statement. It is virtually impossible to max out a 533Mhz FSB on a machine with only one physical CPU. With dual channel DDR SDRAM, you might be able to do it for very short periods of time on realistic loads, but frankly I doubt even that (because a 533Mhz FSB is always 533Mhz, whereas memory bandwidth numbers are peak numbers that can only be sustained for times on the order of microseconds). The comparison here is not between one cpu and two cpu's on a 533 mbps FSB. The comparison here is between one P4 CPU with HT on 800 mbps FSB against two physical Xeon processors on the shared 533 mbps FSB with the proportionally lower bandwidth memory system. Huh?! I'm specifically talking about the question the OP asked, which is how much of a difference a 533Mhz FSB will make versus an 800Mhz FSB, he was always talking one physical processor, two logical processors. Why you want to compare that to two physical Xeon proessors is beyond me. If we could simply take your statements about "virtually impossible to max out a 533 (mbps) interface with a single CPU", and replace it with "slightly less impossible to max out a 533 (mbps) interface with a single cpu that has two threaded contexts (SMT/HT)". Then that could be synced closely enough with my original statement about which platform would be "higher performance" for the original poster. I sure as heck don't believe that a faster FSB will translate into performance improvements when the FSB isn't the bottleneck. Depends on your workload. What does your workload look like? What workload do you think could max out a 533Mhz FSB on an HTT P4? Or, put it another way, concoct the most memory-intensive benchmark you can, then compute the real, effective memory bandwidth of a 533 Mhz FSB. See if you can get anywhere close to that on a real system with DDR SDRAM. DS |
#12
|
|||
|
|||
Hmm, he did say *dual* XEON. All my comments were about UP systems. On a system with more than one physical CPU, FSB speed is critical. Not because of memory bandwidth but because of inter-CPU bandwidth (for cache coherency traffic). DS |
#13
|
|||
|
|||
David Schwartz wrote:
"David Wang" wrote in message ... Having HT give you multiple contexts within the same silicon so that there are more instructions to choose from, and as follows, possibly more (concurrent) memory references as well. The HT/SMT aspect of the processor will be able to (more easily) saturate the FSB/memory system. The total amount of memory accesses, conccurent or no, is limited by the memory bandwidth or the FSB bandwidth, whichever is less. The memory bandwidth is less. It's really that simple. .... which could be saturated to a greater extent by the HT aspect of the P4 processor ... If one logical processor is doing a lot of memory accesses and gets stalled and the other one doesn't need the memory bus, then the other logical processor can keep running. So again, one processor is held up due to memory speed and the other runs full tilt. How will a faster FSB help? The faster FSB is (presumed to be) tied to a memory system that supports higher memory bandwidth so that the queuing delays would be smaller on the 800 mbps FSB (because it has a higher BW memory system), as compared to the 533 mbps FSB on the Xeon platform. The delays are dominated by the memory timings. It will be memory speed that will make the difference, not FSB speed. We're still not communicating properly. The comparison is between a P4-HT system against a Dual Xeon system. Presumably, the P4-HT system is attached to the dual channel (PC3200) DDR SDRAM system, and the Xeon platform has a proportionally lower bandwidth memory system. The "memory speed" is coupled to the "FSB speed". Although this point was not explicitly declared in earlier portions of this thread, I had believed (incorrectly apparently) that this point was implicitly understood. The P4 HT platform not only has the higher bandwidth FSB, it has a proportionally higher bandwidth DRAM system as well. If you want to go back and try to run the 800 mbps with DDR SDRAM memory running at 266 mbps just to "equalize" the memory system between the P4 HT platform as compared to the dual Xeon platform, I suppose you can, but you would simply be slowing down parts of the P4 HT platform just to skew the comparison. Yep, and if the memory is not the bottleneck for the CPU, then the FSB speed doesn't matter. All the data will be prefetched and lazily stored. But if the memory is the bottleneck, then a faster FSB again won't help because it won't make the memory faster. We're not on the same page here. The P4-HT platform is assumed to have a memory system with higher DRAM bandwidth. I never said DRAM bandwidth won't affect performance. I said FSB speed won't affect performance. You may be debating the theoretics of the FSB performance as separate from memory system performace. That is... assuming "equivalent DRAM memory system", but that is not the case here. The P4 HT system has the more capable (higher bandwidth) DRAM system with the "dual channel" PC3200 DDR SDRAM. Here's a hypothetical for you. You take the machine with the 533Mhz FSB and ratchet it up to 800Mhz leaving all else the same. How much do you think it's memory access performance will improve? This is difficult to answer, depends on whether you keep the "cycle latency" the same or the "wall clock tick latency" the same. The comparison here is not between one cpu and two cpu's on a 533 mbps FSB. The comparison here is between one P4 CPU with HT on 800 mbps FSB against two physical Xeon processors on the shared 533 mbps FSB with the proportionally lower bandwidth memory system. Huh?! I'm specifically talking about the question the OP asked, which is how much of a difference a 533Mhz FSB will make versus an 800Mhz FSB, he was always talking one physical processor, two logical processors. Why you want to compare that to two physical Xeon proessors is beyond me. Please read the title of this thread. You have misunderstood the premise of the comparison from the very beginning. His question was in comparing the performance of a single P4 HT processor based platform against the performance of his application on a dual Xeon platform. Your single processor comparison may be valid, but not in the context of this thread. If we could simply take your statements about "virtually impossible to max out a 533 (mbps) interface with a single CPU", and replace it with "slightly less impossible to max out a 533 (mbps) interface with a single cpu that has two threaded contexts (SMT/HT)". Then that could be synced closely enough with my original statement about which platform would be "higher performance" for the original poster. I sure as heck don't believe that a faster FSB will translate into performance improvements when the FSB isn't the bottleneck. See below. Depends on your workload. What does your workload look like? What workload do you think could max out a 533Mhz FSB on an HTT P4? Or, put it another way, concoct the most memory-intensive benchmark you can, then compute the real, effective memory bandwidth of a 533 Mhz FSB. See if you can get anywhere close to that on a real system with DDR SDRAM. You are back to the "FSB" thing in isolation again. The P4 HT (800 mbps) FSB has a more bandwidth capable memory system behind it as compared to the (533 mbps) Xeon platform. When the application is bandwidth saturated, the P4 HT platform will simply perform better. -- davewang202(at)yahoo(dot)com |
#14
|
|||
|
|||
I am taking a gamble and have ordered a Pentium 4 HT, 2.8ghz with a FSB of
I've got a Dual Xeon 2.8Ghz Dell 2600 in the rack here if you want to see some numbers...hehe |
#15
|
|||
|
|||
"David Wang" wrote in message ... David Schwartz wrote: Yes he did. The systems to be compared a a single processor P4 HT system against a dual Xeon system. The P4 HT system not only has the higher data rate FSB, it also has higher (peak and sustainable) memory bandwidth as well. On something that is memory intensive, the P4-HT platform will be "higher performance". Hence my statement from the very beginning about the relative performance of the two platforms. Depends on your workload. What does your workload look like? In fact, with some workloads, SMP with a 533Mhz FSB can actually totally suck. Inter-CPU traffic can saturate the FSB for some workloads. On a system with more than one physical CPU, FSB speed is critical. Not because of memory bandwidth but because of inter-CPU bandwidth (for cache coherency traffic). I cannot buy this argument. Oh goody, something else to argue about. For things like SPEC rate or someone running multiple instances of SETI@HOME or some multithreaded photoshop filter, there is no cache coherency traffic, since all of the snoop broadcast requests will return with the status of snoop miss, and the issue is simply bandwidth sharing on the FSB. You forget about system calls, spinlocks, handles that reference global tables, and all other sorts of nastiness. For things like TPC or multithreaded quake or some other program that use shared memory segments, there will be cache coherency traffic, but even when there is a HIT or HITM returned as the result of the snoop request, the CPU to CPU cacheline transfer would still occupy the same amount of time on the data bus as an ordinary read request that had a snoop miss. So in this case, the bandwidth utilization (on the data bus) is still the same. The total data bus utilization would look the same whether the inter-CPU traffic is 0% or 100% of all memory read requests. There is no UP phenomenon analogous to cache ping-ponging on an SMP machine. On a UP machine, these would be L2 cache hits. Consider two threads of the same process one running on each processor. They're doing a lot of calls to 'malloc' and 'free' and the memory allocator isn't well optimized for multithreaded code. So the tracking information for the 'malloc' structures is constantly being evicted from one CPUs cache to populate the others, over the FSB, bypassing memory. There is nothing analogous in a UP system (except perhaps DMA). DS |
#16
|
|||
|
|||
|
#17
|
|||
|
|||
David Schwartz wrote:
"Alex Johnson" wrote in message ... In this case the 2 P4s still have the same bandwidth to memory while the 2 Opterons have twice the bandwidth to memory. I would think the bandwidth to memory is limited by the speed of the memory itself. If the memory is incredibly fast, then the bandwidth to memory is limited by the FSB shared by both CPUs. Opterons do not have a shared FSB. They share a HT link for cache coherency, but that's apparently not the bottleneck for the highly parallelizable workload. Each Opteron has a private pool of "Dual channel" DDR SDRAM (In reality a single channel 128 bit wide DDR SDRAM memory system), whereas the Dual P4/Xeon has to share the bandwidth to the memory system, even when the workload is highly parallelizable. Alex's speculation is likely correct as to the cause of the apparent non-speedup of the workload on the dual P4/Xeon platform. -- davewang202(at)yahoo(dot)com |
#18
|
|||
|
|||
David Schwartz wrote:
"David Wang" wrote in message ... Opterons do not have a shared FSB. They share a HT link for cache coherency, but that's apparently not the bottleneck for the highly parallelizable workload. Each Opteron has a private pool of "Dual channel" DDR SDRAM (In reality a single channel 128 bit wide DDR SDRAM memory system), whereas the Dual P4/Xeon has to share the bandwidth to the memory system, even when the workload is highly parallelizable. Alex's speculation is likely correct as to the cause of the apparent non-speedup of the workload on the dual P4/Xeon platform. So MP Opteron machines are NUMA? Surely one processor can't access the other processor's memory as rapidly as it can access its own. Yes, it is effectively NUMA. Currently with NON-NUMA aware OS's, data placement is not NUMA aware, so the two separate pools of memory are treated as "uniform", and associated penalities are exacted for the ignorance in the data placement aspect of the memory allocation. IIRC, a local memory reference takes about 50ns, and a remote access over Hypertransport takes closer to 100ns, so even when averaged out, the average memory access latency is still fairly decent compared to P4/Xeon's separate Northbridge memory controller setup. Do you know any place to find information about how memory access works in an MP Opteron machine? What happens if one CPU keeps accessing memory physically attached to the other? Can either CPU cache memory attached to either CPU? http://www.hotchips.org/archive/hc14...r_MP_HC_v8.pdf Also, documentations may be found on AMD's web site. Essentially Opteron is just a K7 CPU core with a built in router connected to it on the same piece of silicon. It also has a memory controller sitting on the other side of the router. When the K7 CPU misses L1 and L2, it sends the request to the router (Xbar), and if the request is local, it sends the request into a queue into the local memory controller, else the request is sent off chip through the hypertransport links. Each CPU can freely access any other CPU's memory. It doesn't look like bus snooping is possible, and I don't see any cache coherency interface on the CPU that could be connected to another CPU. Is one of the HT links dedicated to cache coherency and other inter-processor traffic? The Hypertransport links are generic. They carry both the snoop coherency traffic as well as the data traffic. I believe AMD's uses a slightly specialized cache coherent implementation of Hypertransport aka ccHT. I've searched AMD's web site but I can't find anything explaining how it works. -- davewang202(at)yahoo(dot)com |
#19
|
|||
|
|||
"David Wang" wrote in message ... Also, documentations may be found on AMD's web site. Essentially Opteron is just a K7 CPU core with a built in router connected to it on the same piece of silicon. It also has a memory controller sitting on the other side of the router. When the K7 CPU misses L1 and L2, it sends the request to the router (Xbar), and if the request is local, it sends the request into a queue into the local memory controller, else the request is sent off chip through the hypertransport links. Thanks for the additional information. http://www.hotchips.org/archive/hc14...r_MP_HC_v8.pdf That link answered all my questions. Sounds like a very impressive design. DS |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
dual CPU set-up/ | john_D | General | 6 | January 16th 05 12:28 AM |
< |
Alexander Gorban | Compaq Servers | 0 | October 24th 03 07:04 AM |
< |
Alexander Gorban | Compaq Servers | 0 | October 23rd 03 08:48 AM |
Pentium 4 HT vs. Pentium Xeon? | Ilker Tarkan | Intel | 2 | July 9th 03 12:48 PM |
Recomendations for a dual Xeon board | Jim H | Homebuilt PC's | 3 | July 4th 03 09:35 PM |