Pentium 4 HT vs dual XEON

#11 July 15th 03, 11:37 PM

"David Wang" wrote in message
...

The point here is that each/both threaded contexts will try to execute
intructions past memory references that miss the cache. With only a
single threaded context, the CPU can only go so deep before it runs out
of instructions to execute, so it will stop issuing furthere memory
references, even if those references are completely independent of the
outstanding memory references. Having HT allows the CPU to execute
down multiple threaded instruction streams, and there could well be
more memory reference before both threads stall out completely waiting
for some old reference to return with data.

Right. This is just another case of a processor that is limited by the
memory bandwidth.

It's a tautology -- if the memory is the bottleneck, then the memory
is
the bottleneck and the instructions will proceed as fast as the memory
can
get the data to and from them. HT or no HT.

Having HT give you multiple contexts within the same silicon so that
there are more instructions to choose from, and as follows, possibly
more (concurrent) memory references as well. The HT/SMT aspect of the
processor will be able to (more easily) saturate the FSB/memory system.

The total amount of memory accesses, conccurent or no, is limited by the
memory bandwidth or the FSB bandwidth, whichever is less. The memory
bandwidth is less. It's really that simple.

If one logical processor is doing a lot of memory accesses and gets
stalled and the other one doesn't need the memory bus, then the other
logical processor can keep running. So again, one processor is held up
due
to memory speed and the other runs full tilt. How will a faster FSB
help?

The faster FSB is (presumed to be) tied to a memory system that supports
higher memory bandwidth so that the queuing delays would be smaller on
the 800 mbps FSB (because it has a higher BW memory system), as compared
to the 533 mbps FSB on the Xeon platform.

The delays are dominated by the memory timings. It will be memory speed
that will make the difference, not FSB speed.

Both threaded contexts can and does issue multiple memory references
into the memory system even while the previous memory reference is
still
outstanding on the memory system.

Yep, and if the memory is not the bottleneck for the CPU, then the
FSB
speed doesn't matter. All the data will be prefetched and lazily stored.
But
if the memory is the bottleneck, then a faster FSB again won't help
because
it won't make the memory faster.

We're not on the same page here. The P4-HT platform is assumed to have
a memory system with higher DRAM bandwidth.

I never said DRAM bandwidth won't affect performance. I said FSB speed
won't affect performance.

Here's a hypothetical for you. You take the machine with the 533Mhz FSB
and ratchet it up to 800Mhz leaving all else the same. How much do you think
it's memory access performance will improve?

I stand by my original statement. It is virtually impossible to max
out
a 533Mhz FSB on a machine with only one physical CPU. With dual channel
DDR
SDRAM, you might be able to do it for very short periods of time on
realistic loads, but frankly I doubt even that (because a 533Mhz FSB is
always 533Mhz, whereas memory bandwidth numbers are peak numbers that
can
only be sustained for times on the order of microseconds).

The comparison here is not between one cpu and two cpu's on a 533 mbps
FSB. The comparison here is between one P4 CPU with HT on 800 mbps FSB
against two physical Xeon processors on the shared 533 mbps FSB with the
proportionally lower bandwidth memory system.

Huh?! I'm specifically talking about the question the OP asked, which is
how much of a difference a 533Mhz FSB will make versus an 800Mhz FSB, he was
always talking one physical processor, two logical processors. Why you want
to compare that to two physical Xeon proessors is beyond me.

If we could simply take your statements about "virtually impossible to
max out a 533 (mbps) interface with a single CPU", and replace it with
"slightly less impossible to max out a 533 (mbps) interface with a
single cpu that has two threaded contexts (SMT/HT)". Then that could
be synced closely enough with my original statement about which
platform would be "higher performance" for the original poster.

I sure as heck don't believe that a faster FSB will translate into
performance improvements when the FSB isn't the bottleneck.

Depends on your workload. What does your workload look like?

What workload do you think could max out a 533Mhz FSB on an HTT P4? Or,
put it another way, concoct the most memory-intensive benchmark you can,
then compute the real, effective memory bandwidth of a 533 Mhz FSB. See if
you can get anywhere close to that on a real system with DDR SDRAM.

DS

#12 July 15th 03, 11:39 PM

Hmm, he did say *dual* XEON. All my comments were about UP systems.

On a system with more than one physical CPU, FSB speed is critical. Not
because of memory bandwidth but because of inter-CPU bandwidth (for cache
coherency traffic).

DS

#13 July 16th 03, 12:57 AM

David Schwartz wrote:

"David Wang" wrote in message
...

Having HT give you multiple contexts within the same silicon so that
there are more instructions to choose from, and as follows, possibly
more (concurrent) memory references as well. The HT/SMT aspect of the
processor will be able to (more easily) saturate the FSB/memory system.

The total amount of memory accesses, conccurent or no, is limited by the
memory bandwidth or the FSB bandwidth, whichever is less. The memory
bandwidth is less. It's really that simple.

.... which could be saturated to a greater extent by the HT aspect of the
P4 processor ...

If one logical processor is doing a lot of memory accesses and gets
stalled and the other one doesn't need the memory bus, then the other
logical processor can keep running. So again, one processor is held up due
to memory speed and the other runs full tilt. How will a faster FSB help?

The faster FSB is (presumed to be) tied to a memory system that supports
higher memory bandwidth so that the queuing delays would be smaller on
the 800 mbps FSB (because it has a higher BW memory system), as compared
to the 533 mbps FSB on the Xeon platform.

The delays are dominated by the memory timings. It will be memory speed
that will make the difference, not FSB speed.

We're still not communicating properly. The comparison is between a
P4-HT system against a Dual Xeon system. Presumably, the P4-HT system
is attached to the dual channel (PC3200) DDR SDRAM system, and the Xeon
platform has a proportionally lower bandwidth memory system. The
"memory speed" is coupled to the "FSB speed". Although this point was
not explicitly declared in earlier portions of this thread, I had
believed (incorrectly apparently) that this point was implicitly
understood. The P4 HT platform not only has the higher bandwidth FSB,
it has a proportionally higher bandwidth DRAM system as well.

If you want to go back and try to run the 800 mbps with DDR SDRAM memory
running at 266 mbps just to "equalize" the memory system between the
P4 HT platform as compared to the dual Xeon platform, I suppose you can,
but you would simply be slowing down parts of the P4 HT platform just to
skew the comparison.

Yep, and if the memory is not the bottleneck for the CPU, then the FSB
speed doesn't matter. All the data will be prefetched and lazily stored. But
if the memory is the bottleneck, then a faster FSB again won't help because
it won't make the memory faster.

We're not on the same page here. The P4-HT platform is assumed to have
a memory system with higher DRAM bandwidth.

I never said DRAM bandwidth won't affect performance. I said FSB speed
won't affect performance.

You may be debating the theoretics of the FSB performance as separate
from memory system performace. That is... assuming "equivalent DRAM
memory system", but that is not the case here. The P4 HT system has the
more capable (higher bandwidth) DRAM system with the "dual channel" PC3200
DDR SDRAM.

Here's a hypothetical for you. You take the machine with the 533Mhz FSB
and ratchet it up to 800Mhz leaving all else the same. How much do you think
it's memory access performance will improve?

This is difficult to answer, depends on whether you keep the "cycle
latency" the same or the "wall clock tick latency" the same.

The comparison here is not between one cpu and two cpu's on a 533 mbps
FSB. The comparison here is between one P4 CPU with HT on 800 mbps FSB
against two physical Xeon processors on the shared 533 mbps FSB with the
proportionally lower bandwidth memory system.

Huh?! I'm specifically talking about the question the OP asked, which is
how much of a difference a 533Mhz FSB will make versus an 800Mhz FSB, he was
always talking one physical processor, two logical processors. Why you want
to compare that to two physical Xeon proessors is beyond me.

Please read the title of this thread. You have misunderstood the
premise of the comparison from the very beginning. His question was in
comparing the performance of a single P4 HT processor based platform
against the performance of his application on a dual Xeon platform.

Your single processor comparison may be valid, but not in the context of
this thread.

If we could simply take your statements about "virtually impossible to
max out a 533 (mbps) interface with a single CPU", and replace it with
"slightly less impossible to max out a 533 (mbps) interface with a
single cpu that has two threaded contexts (SMT/HT)". Then that could
be synced closely enough with my original statement about which
platform would be "higher performance" for the original poster.

I sure as heck don't believe that a faster FSB will translate into
performance improvements when the FSB isn't the bottleneck.

See below.

Depends on your workload. What does your workload look like?

What workload do you think could max out a 533Mhz FSB on an HTT P4? Or,
put it another way, concoct the most memory-intensive benchmark you can,
then compute the real, effective memory bandwidth of a 533 Mhz FSB. See if
you can get anywhere close to that on a real system with DDR SDRAM.

You are back to the "FSB" thing in isolation again.

The P4 HT (800 mbps) FSB has a more bandwidth capable memory system
behind it as compared to the (533 mbps) Xeon platform. When the
application is bandwidth saturated, the P4 HT platform will simply
perform better.

--
davewang202(at)yahoo(dot)com

#14 July 16th 03, 04:48 AM

I am taking a gamble and have ordered a Pentium 4 HT, 2.8ghz with a FSB of

I've got a Dual Xeon 2.8Ghz Dell 2600 in the rack here if you want to see
some numbers...hehe

#15 July 16th 03, 05:34 AM

"David Wang" wrote in message
...

David Schwartz wrote:

Yes he did. The systems to be compared a a single processor P4 HT
system against a dual Xeon system. The P4 HT system not only has the
higher data rate FSB, it also has higher (peak and sustainable) memory
bandwidth as well. On something that is memory intensive, the P4-HT
platform will be "higher performance". Hence my statement from the
very beginning about the relative performance of the two platforms.

Depends on your workload. What does your workload look like?

In fact, with some workloads, SMP with a 533Mhz FSB can actually totally
suck. Inter-CPU traffic can saturate the FSB for some workloads.

On a system with more than one physical CPU, FSB speed is critical.
Not
because of memory bandwidth but because of inter-CPU bandwidth (for
cache
coherency traffic).

I cannot buy this argument.

Oh goody, something else to argue about.

For things like SPEC rate or someone running multiple instances of
SETI@HOME or some multithreaded photoshop filter, there is no cache
coherency traffic, since all of the snoop broadcast requests will
return with the status of snoop miss, and the issue is simply bandwidth
sharing on the FSB.

You forget about system calls, spinlocks, handles that reference global
tables, and all other sorts of nastiness.

For things like TPC or multithreaded quake or some other program that
use shared memory segments, there will be cache coherency traffic, but
even when there is a HIT or HITM returned as the result of the snoop
request, the CPU to CPU cacheline transfer would still occupy
the same amount of time on the data bus as an ordinary read request
that had a snoop miss. So in this case, the bandwidth utilization
(on the data bus) is still the same. The total data bus utilization
would look the same whether the inter-CPU traffic is 0% or 100% of
all memory read requests.

There is no UP phenomenon analogous to cache ping-ponging on an SMP
machine. On a UP machine, these would be L2 cache hits.

Consider two threads of the same process one running on each processor.
They're doing a lot of calls to 'malloc' and 'free' and the memory allocator
isn't well optimized for multithreaded code. So the tracking information for
the 'malloc' structures is constantly being evicted from one CPUs cache to
populate the others, over the FSB, bypassing memory. There is nothing
analogous in a UP system (except perhaps DMA).

DS

#16 July 17th 03, 01:15 PM

wrote:
Mike mike@nospam wrote:

Can someone tell me whether an Pentium 4, 2.8ghz chip with hyperthreading,
FSB of 800mhz and DDR 400 will equal or surpass in performance of a dual
XEON 2.4ghz, 533FSB, and DDR266?

Depends on EXACTLY what you are doing. If your running an earthquake
simulation code I have around here then:

Dual p4-3.0/533 Mhz, no HT mahcine
1 process took 86.43 seconds.
2 proccesses in parallel took 156.9 seconds
Scaling efficiency =~ 10% (2 processes run at the same time have 10% greather
throughput then a single process on a single cpu)

Dual Opteron 240-1.4 Ghz/333 MHz
1 process took 97.87 seconds.
2 proccesses in parallel took 99.79 seconds
Scaling efficiency =~ 96% (2 processes run at the same time have 97% greather
throughput then a single process on a single cpu)

The P4 version is almost totally unscalable while the Opteron version is
almost totally scalable. There is only one thing I can think of...is
your earthquake code totally bandwidth limited? In this case the 2 P4s
still have the same bandwidth to memory while the 2 Opterons have twice
the bandwidth to memory.

Alex
--
My words are my own. They represent no other; they belong to no other.
Don't read anything into them or you may be required to compensate me
for violation of copyright. (I do not speak for my employer.)

#17 July 17th 03, 08:50 PM

David Schwartz wrote:

"Alex Johnson" wrote in message
...

In this case the 2 P4s
still have the same bandwidth to memory while the 2 Opterons have twice
the bandwidth to memory.

I would think the bandwidth to memory is limited by the speed of the
memory itself. If the memory is incredibly fast, then the bandwidth to
memory is limited by the FSB shared by both CPUs.

Opterons do not have a shared FSB. They share a HT link for cache
coherency, but that's apparently not the bottleneck for the highly
parallelizable workload. Each Opteron has a private pool of
"Dual channel" DDR SDRAM (In reality a single channel 128 bit wide
DDR SDRAM memory system), whereas the Dual P4/Xeon has to share the
bandwidth to the memory system, even when the workload is highly
parallelizable. Alex's speculation is likely correct as to the cause
of the apparent non-speedup of the workload on the dual P4/Xeon
platform.

--
davewang202(at)yahoo(dot)com

#18 July 17th 03, 10:11 PM

David Schwartz wrote:

"David Wang" wrote in message
...

Opterons do not have a shared FSB. They share a HT link for cache
coherency, but that's apparently not the bottleneck for the highly
parallelizable workload. Each Opteron has a private pool of
"Dual channel" DDR SDRAM (In reality a single channel 128 bit wide
DDR SDRAM memory system), whereas the Dual P4/Xeon has to share the
bandwidth to the memory system, even when the workload is highly
parallelizable. Alex's speculation is likely correct as to the cause
of the apparent non-speedup of the workload on the dual P4/Xeon
platform.

So MP Opteron machines are NUMA? Surely one processor can't access the
other processor's memory as rapidly as it can access its own.

Yes, it is effectively NUMA.

Currently with NON-NUMA aware OS's, data placement is not NUMA aware, so
the two separate pools of memory are treated as "uniform", and
associated penalities are exacted for the ignorance in the data
placement aspect of the memory allocation.

IIRC, a local memory reference takes about 50ns, and a remote access
over Hypertransport takes closer to 100ns, so even when averaged out,
the average memory access latency is still fairly decent compared to
P4/Xeon's separate Northbridge memory controller setup.

Do you know any place to find information about how memory access works
in an MP Opteron machine? What happens if one CPU keeps accessing memory
physically attached to the other? Can either CPU cache memory attached to
either CPU?

http://www.hotchips.org/archive/hc14...r_MP_HC_v8.pdf

Also, documentations may be found on AMD's web site. Essentially
Opteron is just a K7 CPU core with a built in router connected to it
on the same piece of silicon. It also has a memory controller sitting on
the other side of the router. When the K7 CPU misses L1 and L2, it sends
the request to the router (Xbar), and if the request is local, it sends
the request into a queue into the local memory controller, else the
request is sent off chip through the hypertransport links.

Each CPU can freely access any other CPU's memory.

It doesn't look like bus snooping is possible, and I don't see any cache
coherency interface on the CPU that could be connected to another CPU. Is
one of the HT links dedicated to cache coherency and other inter-processor
traffic?

The Hypertransport links are generic. They carry both the snoop coherency
traffic as well as the data traffic. I believe AMD's uses a slightly
specialized cache coherent implementation of Hypertransport aka ccHT.

I've searched AMD's web site but I can't find anything explaining how it
works.

--
davewang202(at)yahoo(dot)com

#19 July 18th 03, 12:08 AM

"David Wang" wrote in message
...

Also, documentations may be found on AMD's web site. Essentially
Opteron is just a K7 CPU core with a built in router connected to it
on the same piece of silicon. It also has a memory controller sitting on
the other side of the router. When the K7 CPU misses L1 and L2, it sends
the request to the router (Xbar), and if the request is local, it sends
the request into a queue into the local memory controller, else the
request is sent off chip through the hypertransport links.

Thanks for the additional information.

http://www.hotchips.org/archive/hc14...r_MP_HC_v8.pdf

That link answered all my questions. Sounds like a very impressive
design.

DS

#20 July 18th 03, 01:57 AM

Alex Johnson wrote:
wrote:
Depends on EXACTLY what you are doing. If your running an earthquake
simulation code I have around here then:

Dual p4-3.0/533 Mhz, no HT mahcine
1 process took 86.43 seconds.
2 proccesses in parallel took 156.9 seconds
Scaling efficiency =~ 10% (2 processes run at the same time have 10% greather
throughput then a single process on a single cpu)

Dual Opteron 240-1.4 Ghz/333 MHz
1 process took 97.87 seconds.
2 proccesses in parallel took 99.79 seconds
Scaling efficiency =~ 96% (2 processes run at the same time have 97% greather
throughput then a single process on a single cpu)

The P4 version is almost totally unscalable while the Opteron version is
almost totally scalable. There is only one thing I can think of...is
your earthquake code totally bandwidth limited? In this case the 2 P4s
still have the same bandwidth to memory while the 2 Opterons have twice
the bandwidth to memory.

I don't think so, a dual athlon has in general both less bandwidth,
and the same shared memory design and I get:

Dual K7-1.4 Ghz/266 MHz
1 process took 139.1694 seconds
2 proccesses in parallel took 156.7569 seconds
Scaling efficiency =~ 77.5% (2 processes run at the same time have 77.5%
greather throughput then a single process
on a single cpu)

So with 1 job the Dual K7-1.4 Ghz loses substantially compared
to the P4-3.0 Ghz. But with 2 processes it wins handily.

--
Bill Broadley
Mathematics
UC Davis

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
dual CPU set-up/	john_D	General	6	January 16th 05 12:28 AM
<> XEON PROCESSORS AND MEMORY	Alexander Gorban	Compaq Servers	0	October 24th 03 07:04 AM
<> XEON PROCESSORS AND MEMORY	Alexander Gorban	Compaq Servers	0	October 23rd 03 08:48 AM
Pentium 4 HT vs. Pentium Xeon?	Ilker Tarkan	Intel	2	July 9th 03 12:48 PM
Recomendations for a dual Xeon board	Jim H	Homebuilt PC's	3	July 4th 03 09:35 PM