HardwareBanter - View Single Post - fastest floating point operation as possible

#4 January 30th 04, 12:40 AM

On Thu, 29 Jan 2004 09:39:52 -0800, Paul Spitalny
wrote:

Hi,
I have a machine with a pentium4 2.52Ghz processor with 1Gig of Rambus
memory. I think the bus speed is 500Mhz (or thereabouts)? The machine is
about 1.5 years old.

The question I have is this:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase? That is, will the speed improvement only be
3.2Ghz/2.5GHz = 1.28 (28 % speed increase), or, is the architecture and
bus speed going to give me much more performance than I currently have??

CPUs don't scale according to clockrate, so no, everything else being
the same, you'll get less than 28% increase.

OTH, 800MHz fsb seem to generally cheer up the P4 quite a bit, so that
could be in favor. It depends on the code though. Unfortunatly fp-ish
benchmarks like 3D rendition, show zero improvement from 800MHz FSB.
-Sorry.

But, going deeper on this fp math might be a good idea.
What _kind_ of fp math is it?
Is it compiled to old fashioned '387 operations?
Or is it autovectorized/optimized for SSE2?
Is it double precision or single?
Does it contain division, how much?
Is there a lot of conditional instructions, branches?

The P4 is pretty much a wimp on everything fp, except vectorized,
straightforward mul, add, sub, using SSE2.
A lot of time consuming work like matrix/tensor multiplications,
transformations etc. does fall into that category though. So it might
be a good idea, to see to it, that the code is compiled with Intels
auto vectorizing optimizing compiler.
If everything is optimal, you can get 3-3.5 times the performance on
single precision fp (this is the kind of performance you see in P4
video encoding). On the other hand, branches, division, ruin it all.

Just reading some single benchmark, is not going to be of any use to
you. P4/Xeon benchmarks tend to be 100% SSE2, outrageously optimized
and highly flattering for Intel. Real applications might be a
different thing (scalar '387?). Unless you know what the code looks
like, and how it is compiled, you cannot be sure to get the
performance common benchmarks imply.
If you write the software yourself, Intels compiler is a free
download. Try it if you haven't already.

The other suggestion is to try AMD instead. All AMD families,
AthlonXP, Athlon64, Opteron, are brutish on scalar '387 math. They
also handle branches, division, underflow/overflow better than the P4.
Try borrowing an AthlonXP and see if the code suits it better.

Ancra