fastest floating point operation as possible

#1 January 29th 04, 05:39 PM

Hi,
I have a machine with a pentium4 2.52Ghz processor with 1Gig of Rambus
memory. I think the bus speed is 500Mhz (or thereabouts)? The machine is
about 1.5 years old.

The question I have is this:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase? That is, will the speed improvement only be
3.2Ghz/2.5GHz = 1.28 (28 % speed increase), or, is the architecture and
bus speed going to give me much more performance than I currently have??

Thanks,

Paul

#2 January 29th 04, 06:11 PM

In article ,
Paul Spitalny wrote:
Hi,
I have a machine with a pentium4 2.52Ghz processor with 1Gig of Rambus
memory. I think the bus speed is 500Mhz (or thereabouts)? The machine is
about 1.5 years old.

The question I have is this:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase? That is, will the speed improvement only be
3.2Ghz/2.5GHz = 1.28 (28 % speed increase), or, is the architecture and
bus speed going to give me much more performance than I currently have??

The specfp site seems to indicate the fastest machine, under some
cases, is an AMD Opteron on a ASUS SK8N Motherboard. There's lots
of info there.

http://www.specbench.org/cpu2000/results/res2003q4/

Thanks,

Paul

--
Al Dykes
-----------

#3 January 29th 04, 10:27 PM

Paul Spitalny wrote:

Hi,
I have a machine with a pentium4 2.52Ghz processor with 1Gig of Rambus
memory. I think the bus speed is 500Mhz (or thereabouts)? The machine is
about 1.5 years old.

The question I have is this:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase? That is, will the speed improvement only be
3.2Ghz/2.5GHz = 1.28 (28 % speed increase), or, is the architecture and
bus speed going to give me much more performance than I currently have??

Thanks,

Paul

I seem to recall one of the AMD chips being especially good at floating
point ops??? (while browsing the site recently)

Lurker

#4 January 30th 04, 12:40 AM

On Thu, 29 Jan 2004 09:39:52 -0800, Paul Spitalny
wrote:

Hi,
I have a machine with a pentium4 2.52Ghz processor with 1Gig of Rambus
memory. I think the bus speed is 500Mhz (or thereabouts)? The machine is
about 1.5 years old.

The question I have is this:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase? That is, will the speed improvement only be
3.2Ghz/2.5GHz = 1.28 (28 % speed increase), or, is the architecture and
bus speed going to give me much more performance than I currently have??

CPUs don't scale according to clockrate, so no, everything else being
the same, you'll get less than 28% increase.

OTH, 800MHz fsb seem to generally cheer up the P4 quite a bit, so that
could be in favor. It depends on the code though. Unfortunatly fp-ish
benchmarks like 3D rendition, show zero improvement from 800MHz FSB.
-Sorry.

But, going deeper on this fp math might be a good idea.
What _kind_ of fp math is it?
Is it compiled to old fashioned '387 operations?
Or is it autovectorized/optimized for SSE2?
Is it double precision or single?
Does it contain division, how much?
Is there a lot of conditional instructions, branches?

The P4 is pretty much a wimp on everything fp, except vectorized,
straightforward mul, add, sub, using SSE2.
A lot of time consuming work like matrix/tensor multiplications,
transformations etc. does fall into that category though. So it might
be a good idea, to see to it, that the code is compiled with Intels
auto vectorizing optimizing compiler.
If everything is optimal, you can get 3-3.5 times the performance on
single precision fp (this is the kind of performance you see in P4
video encoding). On the other hand, branches, division, ruin it all.

Just reading some single benchmark, is not going to be of any use to
you. P4/Xeon benchmarks tend to be 100% SSE2, outrageously optimized
and highly flattering for Intel. Real applications might be a
different thing (scalar '387?). Unless you know what the code looks
like, and how it is compiled, you cannot be sure to get the
performance common benchmarks imply.
If you write the software yourself, Intels compiler is a free
download. Try it if you haven't already.

The other suggestion is to try AMD instead. All AMD families,
AthlonXP, Athlon64, Opteron, are brutish on scalar '387 math. They
also handle branches, division, underflow/overflow better than the P4.
Try borrowing an AthlonXP and see if the code suits it better.

Ancra

#5 January 30th 04, 04:26 AM

Paul Spitalny writes:
Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math).

Depending on whether you have access to the simulation engine code
or not and whether you want to put in the effort or not the floating
point digital signal processing chips now routinely provide over 3
gigaflops/second if you can get your code to fit inside the constantly
increasing memory that is inside these parts.

Both Texas Instruments and Analog Devices produce such parts and
boards and development tools, and there are an assortment of companies
mounting these on boards and providing development tools. Some of
these even provide multiple processors per board, if your job is
suited to that and you decide you actually need to go fast.

#6 February 2nd 04, 08:13 PM

in wrote:
On Thu, 29 Jan 2004 09:39:52 -0800, Paul Spitalny
wrote:

Hi,
I have a machine with a pentium4 2.52Ghz processor with 1Gig of Rambus
memory. I think the bus speed is 500Mhz (or thereabouts)? The machine is
about 1.5 years old.

The question I have is this:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase? That is, will the speed improvement only be
3.2Ghz/2.5GHz = 1.28 (28 % speed increase), or, is the architecture and
bus speed going to give me much more performance than I currently have??

CPUs don't scale according to clockrate, so no, everything else being
the same, you'll get less than 28% increase.

OTH, 800MHz fsb seem to generally cheer up the P4 quite a bit, so that
could be in favor. It depends on the code though. Unfortunatly fp-ish
benchmarks like 3D rendition, show zero improvement from 800MHz FSB.
-Sorry.

But, going deeper on this fp math might be a good idea.
What _kind_ of fp math is it?
Is it compiled to old fashioned '387 operations?
Or is it autovectorized/optimized for SSE2?
Is it double precision or single?
Does it contain division, how much?
Is there a lot of conditional instructions, branches?

The P4 is pretty much a wimp on everything fp, except vectorized,
straightforward mul, add, sub, using SSE2.
A lot of time consuming work like matrix/tensor multiplications,
transformations etc. does fall into that category though. So it might
be a good idea, to see to it, that the code is compiled with Intels
auto vectorizing optimizing compiler.
If everything is optimal, you can get 3-3.5 times the performance on
single precision fp (this is the kind of performance you see in P4
video encoding). On the other hand, branches, division, ruin it all.

Just reading some single benchmark, is not going to be of any use to
you. P4/Xeon benchmarks tend to be 100% SSE2, outrageously optimized
and highly flattering for Intel. Real applications might be a
different thing (scalar '387?). Unless you know what the code looks
like, and how it is compiled, you cannot be sure to get the
performance common benchmarks imply.
If you write the software yourself, Intels compiler is a free
download. Try it if you haven't already.

The other suggestion is to try AMD instead. All AMD families,
AthlonXP, Athlon64, Opteron, are brutish on scalar '387 math. They
also handle branches, division, underflow/overflow better than the P4.
Try borrowing an AthlonXP and see if the code suits it better.

Ancra

Hi Ancra,
Well, I asked the software vendor (the software that I run to do
simulation work with) about the mathematics in their program and this is
what they said:

Q: Does the code use mostly floating point math operations. If so, then:
What _kind_ of floating point math is it?
Is it compiled to old fashioned '387 operations?

A: Yes. We compile generic version which must be supported by most
number existing x86 processors as possible. As result we don't optimize
Sthe code for particular x86 instruction set extension.

Q: Or is it autovectorized/optimized for SSE2?
A: No.

Q: Is it double precision or single?
A: Double as in original Berkeley Spice 3.

Q: Does it contain division, how much?
A: It's hard to tell. It's depend on what you want to simulate with
SmartSpice.

Q: Is there a lot of conditional instructions, branches?
A: Sure.

Q: Is the code (for windows) compiled with Intels auto vectorizing
optimizing
compiler?
A: No.

That being the case I wonder how to proceed. I can halp but think that
the newest "extreme" pentium (now up to 3.4Ghz clock and 800MHz FSB) has
got to be significantly faster than my older 2.5GHz pentium 4 (with
RAMBUS memory). The "extreme" processor has 1Meg of L2 cache and you
would think that'd help too.

Or, do you feel like the AMD chips might be better since they are known
for better performance at floating point? You see, the guys I get my
software from, as they mention above, don't compile for specific
processors or to optimixe performance.

By the way, thank you for your response to my posting!!

Paul;

#7 February 2nd 04, 08:15 PM

Don Taylor wrote:

Paul Spitalny writes:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math).

Depending on whether you have access to the simulation engine code
or not and whether you want to put in the effort or not the floating
point digital signal processing chips now routinely provide over 3
gigaflops/second if you can get your code to fit inside the constantly
increasing memory that is inside these parts.

Both Texas Instruments and Analog Devices produce such parts and
boards and development tools, and there are an assortment of companies
mounting these on boards and providing development tools. Some of
these even provide multiple processors per board, if your job is
suited to that and you decide you actually need to go fast.
Hi Don,
Unfortunately, I don't have access to the source code. But, your idea is
an interesting one....I am not sure I have the expertise to pull it off
though!

Thanks!

Paul

#8 February 2nd 04, 11:42 PM

Paul Spitalny writes:
Don Taylor wrote:
Paul Spitalny writes:
Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math).

Depending on whether you have access to the simulation engine code
or not and whether you want to put in the effort or not the floating
point digital signal processing chips now routinely provide over 3
gigaflops/second if you can get your code to fit inside the constantly
increasing memory that is inside these parts.

Unfortunately, I don't have access to the source code. But, your idea is
an interesting one....I am not sure I have the expertise to pull it off
though!

Reading your other posts, I might suggest asking your Spice vendor to
tell you how much improvement you are going to get if you switch to a
different processor. They certainly should know the answer to this,
even if it takes your handing over your spice model to them to run.

And if there is money in the budget you might compare the speed of
the Spice packages available from a few vendors, again perhaps needing
to hand over a copy of your typical model.

#9 February 3rd 04, 03:47 AM

On Mon, 02 Feb 2004 12:13:51 -0800, Paul Spitalny
wrote:

Hi Ancra,
Well, I asked the software vendor (the software that I run to do
simulation work with) about the mathematics in their program and this is
what they said:

Q: Does the code use mostly floating point math operations. If so, then:
What _kind_ of floating point math is it?
Is it compiled to old fashioned '387 operations?

A: Yes. We compile generic version which must be supported by most
number existing x86 processors as possible. As result we don't optimize
Sthe code for particular x86 instruction set extension.

Q: Or is it autovectorized/optimized for SSE2?
A: No.

Q: Is it double precision or single?
A: Double as in original Berkeley Spice 3.

Q: Does it contain division, how much?
A: It's hard to tell. It's depend on what you want to simulate with
SmartSpice.

Q: Is there a lot of conditional instructions, branches?
A: Sure.

Q: Is the code (for windows) compiled with Intels auto vectorizing
optimizing
compiler?
A: No.

That being the case I wonder how to proceed. I can halp but think that
the newest "extreme" pentium (now up to 3.4Ghz clock and 800MHz FSB) has
got to be significantly faster than my older 2.5GHz pentium 4 (with
RAMBUS memory). The "extreme" processor has 1Meg of L2 cache and you
would think that'd help too.

Or, do you feel like the AMD chips might be better since they are known
for better performance at floating point? You see, the guys I get my
software from, as they mention above, don't compile for specific
processors or to optimixe performance.

By the way, thank you for your response to my posting!!

With those answers, it might be worthwhile to try AMD!

Is it just me, or wouldn't it be a simple matter for you to check?
Just borrow an AthlonXP machine and try the software on it. You should
get some definite indication.

The P4 has weaknesses, and some of those are basically everything you
listed... Whatever article you've read, I can almost guarantee you
every single synthetic benchmark is virtually 100% SSE2.

While I can't predict the outcome with any certainty, I think you
should definitely try to see what an AMD cpu makes of it.
So I wouldn't look to either P4, Extreme Edition or Prescott for a
solution. Because my perception is that the Intel architecture might
not cut it.
Sure, it would be an improvement, but as you might gather I doubt it
will be "fastest". Also, if a P4EE is going to run continuously on
100% for hours, you need some good cooling solution, or it will just
throttle. It's also going to cost an awful lot of money, for very
little more performance than a vanilla P4C.
With that kind of money for a PC, I'd start to go crazy and drool over
plans for a Prometheus case, DDR533 and CPU frozen to -40degC and
overclocked 30-40%.

If AMD checks out, a machine that would look attractive to me are the
coming socket 939 Athlon64_3400+ or 3700+.
Dual channel Athlon64s looks like the perfect science/math
PC-workstation to me.

Spice sounds vaguely familiar. Isn't that analog electric circuitry?

ancra

#10 February 3rd 04, 02:46 PM

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase?

I've been watching the answers to this question, because I am in a
somewhat similar situation. I have some VERY floating-point
intensive analysis programs that typically run for several hours
on an Athlon XP2100+. These programs operate upon huge arrays of
data, so I suspect that the choke point in my situation is memory
bandwidth -- I am using an old ABIT KT7A that only supports SDRAM
at 133 MHz.

As for standard 387 vs. SSE vs. SSE2 optimizations, I wrote the
programs myself, so I can compile them to use whatever features
are available on the particular processor that I use (Visual
Studio .NET Pro). Probably 85% of the fp operations are evenly
split among mult, add/sub, and trig functions (sin, cos), while
the other 15% are div or division-like (arctan, sqrt). About 10%
of the instruction streams involve branches. Right now,
everything is double-precision, but it might be possible to use
single-precision; I haven't tried it.

So my variation on the original poster's question is: What
high-speed system would best solve my memory bandwidth problems,
in addition to my processing power problems? How does a DDR 400
Athlon compare with an 800 MHz fsb P4? If I make the jump to an
Athlon 64 or P4 Extreme, will their 64-bit data buses offer as
much an advantage as it would seem?

I'm fairly computer-savvy, but frankly I've lost track of how to
compare memory speeds on the Athlon with those on the Pentium 4.
If somebody could point me to a tutorial, it would be much
appreciated.

Thanks,
GB

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Balance Point, AGP Overclocking	David B.	Overclocking	6	April 19th 05 01:42 PM
Passmark Performance Test, Division, Floating Point Division, 2DShapes	@(none)	General	0	August 19th 04 11:57 PM
Floating Point Operations & AMD	Keith B. Silverman	Overclocking AMD Processors	1	August 5th 04 02:07 PM
my new mobo o/c's great	rockerrock	Overclocking AMD Processors	9	June 30th 04 08:17 PM
AMD64 vs. a floating point operation (FLOP)	Only NoSpammers	AMD x86-64 Processors	8	June 27th 04 03:55 PM