fastest floating point operation as possible

#11 February 3rd 04, 02:51 PM

On Mon, 02 Feb 2004 12:13:51 -0800, Paul Spitalny
wrote:

wrote:

Hi Ancra,
Well, I asked the software vendor (the software that I run to do
simulation work with) about the mathematics in their program and this is
what they said:

Q: Does the code use mostly floating point math operations. If so, then:
What _kind_ of floating point math is it?
Is it compiled to old fashioned '387 operations?

A: Yes. We compile generic version which must be supported by most
number existing x86 processors as possible. As result we don't optimize
Sthe code for particular x86 instruction set extension.

Q: Or is it autovectorized/optimized for SSE2?
A: No.

Q: Is it double precision or single?
A: Double as in original Berkeley Spice 3.

Q: Does it contain division, how much?
A: It's hard to tell. It's depend on what you want to simulate with
SmartSpice.

Q: Is there a lot of conditional instructions, branches?
A: Sure.

Q: Is the code (for windows) compiled with Intels auto vectorizing
optimizing
compiler?
A: No.

That being the case I wonder how to proceed. I can halp but think that
the newest "extreme" pentium (now up to 3.4Ghz clock and 800MHz FSB) has
got to be significantly faster than my older 2.5GHz pentium 4 (with
RAMBUS memory). The "extreme" processor has 1Meg of L2 cache and you
would think that'd help too.

Or, do you feel like the AMD chips might be better since they are known
for better performance at floating point? You see, the guys I get my
software from, as they mention above, don't compile for specific
processors or to optimixe performance.

By the way, thank you for your response to my posting!!

Paul;

If it isn't optimize for P4, the AMD chips will be noticably faster. AMD
has a much better general purpose FPU. I would go that way for the best
floating point performance. I doubt the extra L2 cache would have a
noticable difference in performance on a FPU intensive application.

JT

#12 February 3rd 04, 05:05 PM

Greg Berchin writes:
Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math).

I've been watching the answers to this question, because I am in a
somewhat similar situation. I have some VERY floating-point
intensive analysis programs that typically run for several hours
on an Athlon XP2100+. These programs operate upon huge arrays of
data, so I suspect that the choke point in my situation is memory
bandwidth -- I am using an old ABIT KT7A that only supports SDRAM
at 133 MHz.

I've got a 2000 with DDR at 266 Mhz, running XP on an ECS board.
I typically sit in Mathematica having it grind away on things for
hours or days. If you had a fairly reasonable job to run I'd
consider timing it so you could compare what DDR would do for you.
(I'd guess that this would make a fairly small contribution, in
the 10-20% range likely.) Or if it wouldn't take much to code it
in Mathematica, the latest version has been substantially speeded
up with a variety of processor and job specific optimizations.
For the mix of things I do I've often seen 3x gains over the last
version, but I don't know how that would compare to carefully
optimized code generation using other tools.

Email address is valid

#13 February 4th 04, 05:31 AM

Don Taylor wrote:

Paul Spitalny writes:

Don Taylor wrote:

Paul Spitalny writes:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math).

Depending on whether you have access to the simulation engine code
or not and whether you want to put in the effort or not the floating
point digital signal processing chips now routinely provide over 3
gigaflops/second if you can get your code to fit inside the constantly
increasing memory that is inside these parts.

Unfortunately, I don't have access to the source code. But, your idea is
an interesting one....I am not sure I have the expertise to pull it off
though!

Reading your other posts, I might suggest asking your Spice vendor to
tell you how much improvement you are going to get if you switch to a
different processor. They certainly should know the answer to this,
even if it takes your handing over your spice model to them to run.

And if there is money in the budget you might compare the speed of
the Spice packages available from a few vendors, again perhaps needing
to hand over a copy of your typical model.
Hi Don,
Unfortunately, and to my surprise, and dismay, the software vendor has
not tried their code on various platforms to benchmark it. So, they had
no opnion or advice on the best platform to run their code.

And, unfortunately, there's not much money in the budget to buy other
Spice simulators (already have two of them)

SO, I may just have to get my hands on an AND and new P4 machine to see
which is fastest.

Thanks,

Paul

#14 February 5th 04, 04:35 AM

Paul Spitalny wrote:

Hi,
I have a machine with a pentium4 2.52Ghz processor with 1Gig of Rambus
memory. I think the bus speed is 500Mhz (or thereabouts)? The machine is
about 1.5 years old.

The question I have is this:

Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase? That is, will the speed improvement only be
3.2Ghz/2.5GHz = 1.28 (28 % speed increase), or, is the architecture and
bus speed going to give me much more performance than I currently have??

Thanks,

Paul

http://www.microsoft.com/windowsxp/6...ds/upgrade.asp

#15 February 6th 04, 06:54 AM

"Greg Berchin" wrote in message
...
Most of my computer work involves simulations that bring the processor
to its knees (doing floating point math). I am wondering if by going to
the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz
bus) whether I'll get a significant increase in speed beyond the sheer
clock speed increase?

I've been watching the answers to this question, because I am in a
somewhat similar situation. I have some VERY floating-point
intensive analysis programs that typically run for several hours
on an Athlon XP2100+. These programs operate upon huge arrays of
data, so I suspect that the choke point in my situation is memory
bandwidth -- I am using an old ABIT KT7A that only supports SDRAM
at 133 MHz.

As for standard 387 vs. SSE vs. SSE2 optimizations, I wrote the
programs myself, so I can compile them to use whatever features
are available on the particular processor that I use (Visual
Studio .NET Pro). Probably 85% of the fp operations are evenly
split among mult, add/sub, and trig functions (sin, cos), while
the other 15% are div or division-like (arctan, sqrt). About 10%
of the instruction streams involve branches. Right now,
everything is double-precision, but it might be possible to use
single-precision; I haven't tried it.

So my variation on the original poster's question is: What
high-speed system would best solve my memory bandwidth problems,
in addition to my processing power problems? How does a DDR 400
Athlon compare with an 800 MHz fsb P4? If I make the jump to an
Athlon 64 or P4 Extreme, will their 64-bit data buses offer as
much an advantage as it would seem?

I'm fairly computer-savvy, but frankly I've lost track of how to
compare memory speeds on the Athlon with those on the Pentium 4.
If somebody could point me to a tutorial, it would be much
appreciated.

Thanks,
GB
P4EE has a larger cache, not a wider data bus (compared to vnialla P4).
Athlon 64 has internally built memory controller, with either 64 or 128 bit
data bus (depends on model).
If the dataset on which most of the time consuming operations are done, is
close in size to 1-5MB, then going to a larger cache veriety CPU can benefit
more than just increasing memory bandwith with a similar CPU (cache size
wise) or changing system chipset (=memory controller, latency, speed,
bandwidth).
P4's have longer pipelines, wider cache lines, and higher FSB/memory bus.
it's great for streaming, but suffer greater penalty whenever there's a
cache miss and wrongly predicted branching. this is why P4 benefit more from
larger caches (and suffer more performance loss with reduced cache variants)
than Athlons/Durons.
You can go that route: Change only motherboard + memory, so you literally
open up a major bottleneck. Go with a decent PC3200 or top brand PC2700
(first choice better for future), and a motherboard that will take (out of
the box) barton CPUs. Benchmark. If you are still not satisfied from
improvement, go and buy whatever model you can afford of Athlon (barton) XP:
2500+ to 3200+. Then it is possible that the bottleneck would be the HD
subsystem, since loading large amounts of data into a fast RAM will task the
slow HD.
The other route worth persuing is going Athlon64. Get either the 3000+ model
(with somwhat reduced cache) or the 3200+ model, and a K8T600 or nForce3-150
chipset motherboard, and you are futureproof for quite some time.

#16 February 6th 04, 03:07 PM

On Fri, 6 Feb 2004 08:54:49 +0200, "Erez Volach"
wrote:

P4EE has a larger cache, not a wider data bus (compared to vnialla P4).

Um; now I'm really confused. I thought that the P4EE had a 64 bit
data bus, compared with 32 bit on the regular P, P2, P3, P4,
Athlon, and Duron. Is that not correct?

Athlon 64 has internally built memory controller, with either 64 or 128 bit
data bus (depends on model).

I read that the A64 has a 64 bit data bus and a "single data
channel", while the A64FX has a 64 bit data bus and a "double data
channel". I'm not sure what is meant by a "data channel" in this
context, but is that what YOU mean?

http://www.nordichardware.com/review...2GHz/index.php

If the dataset on which most of the time consuming operations are done, is
close in size to 1-5MB, then going to a larger cache veriety CPU can benefit
more than just increasing memory bandwith with a similar CPU (cache size
wise) or changing system chipset (=memory controller, latency, speed,
bandwidth).

My data sets are between 256 Kwords and 512 Kwords, where a word
is a 64 bit double precision float. So it looks like I fall
within that 1-5MB range that you mention.

P4's have longer pipelines, wider cache lines, and higher FSB/memory bus.
it's great for streaming, but suffer greater penalty whenever there's a
cache miss and wrongly predicted branching. this is why P4 benefit more from
larger caches (and suffer more performance loss with reduced cache variants)
than Athlons/Durons.

So, since only about 10% of my operations include branches, it
looks to me like the P4 might be the better choice. Right?

OTOH, I understand that the Athlon has a faster FPU ...

I'm so confused.

You can go that route: Change only motherboard + memory, so you literally
open up a major bottleneck. [...] If you are still not satisfied from
improvement, go and buy whatever model you can afford of Athlon (barton) XP:
2500+ to 3200+.

Yes; interestingly enough, after I posted my message I found
http://www.xbitlabs.com/articles/cpu...4-3200_14.html,
where a standard XP3200+ did remarkably well against the A64FX and
the P4EE in mathematical analysis benchmarks -- exactly the sorts
of things that I am doing.

But if I go with the XP3200+, what do I look for on the
motherboard in terms of "DDR" vs. "dual DDR"? I have looked at
motherboard specs, and "dual DDR" capability seldom seems to be
mentioned. Is it even a concern in my situation?

Then it is possible that the bottleneck would be the HD
subsystem, since loading large amounts of data into a fast RAM will task the
slow HD.

Actually, mine is a streaming application. When the program is
running, the hard drive spins down due to inactivity!

Many thanks for your comments.

GB

#17 February 7th 04, 12:29 AM

On Fri, 06 Feb 2004 15:07:44 GMT, Greg Berchin
wrote:

On Fri, 6 Feb 2004 08:54:49 +0200, "Erez Volach"
wrote:

P4EE has a larger cache, not a wider data bus (compared to vnialla P4).

Um; now I'm really confused. I thought that the P4EE had a 64 bit
data bus, compared with 32 bit on the regular P, P2, P3, P4,
Athlon, and Duron. Is that not correct?

Nope!
All cpus since the original Pentium have 64-bit data bus. I think what
you're looking for is dual channel or 128 bit buses.
Socket 939 and 940 Athlon64s have 128bit bus

No P4 has dual channel memory bus. But some mobos/chipsets have
dualchannel bus. But the bus from memory controller (Northbridge) to
cpu fsb is 64 bit. Same thing with AthlonXPs. But P4C at 800FSB can
use the dual channel better than AthlonXPs. P4EE is nothing but a P4
with a large L3 cache. Not a L2 cache! So it doesn't benefit quite
that much from it.

Provided you use a dual channel DDR400 mobo, and not more than two
memory sticks, the P4C's memory bandwidth is much better. L2 cache
latency is also much better on the P4. That's the easy part of the
answer. Unfortunately, the P4 often seem to have problems translating
those advantages into better real world performance.
As long as it's sequential huge blocks of data that is moved about, or
done fairly simple operations on, the P4 does very well with its
memory bandwidth.
But I can't answer your question regarding DDR400 AthlonXP vs 800MHz
P4. The Athlons have lower bandwidth, but also very big L1 cache and
vastly superior branch handling and out of order execution.

The Athlon64, in turn, memory latency is much superior. Memory
bandwidth of the socket 939 and 940 AMD '86-64 cpus should also be
better.

Some additional information: AMD Opteron, Athlon64 and AthlonFX are
64-bit CPUs. all other are 32-bit. The significance of these bits, are
the address width of the cpu instructions, not any width of data. Plus
that the 64-bit instructions are extended to use more registers, and
in a more rational manner. In all discussions and benchmarks, sofar,
these Athlon64s are treated and used as 32-bit cpus, using 32-bit OS
and 32-bit software. Even so, they still kick ass. With 64-bit
software, they should really start to look interesting.

Athlon 64 has internally built memory controller, with either 64 or 128 bit
data bus (depends on model).

I read that the A64 has a 64 bit data bus and a "single data
channel", while the A64FX has a 64 bit data bus and a "double data
channel". I'm not sure what is meant by a "data channel" in this
context, but is that what YOU mean?

http://www.nordichardware.com/review...2GHz/index.php

If the dataset on which most of the time consuming operations are done, is
close in size to 1-5MB, then going to a larger cache veriety CPU can benefit
more than just increasing memory bandwith with a similar CPU (cache size
wise) or changing system chipset (=memory controller, latency, speed,
bandwidth).

My data sets are between 256 Kwords and 512 Kwords, where a word
is a 64 bit double precision float. So it looks like I fall
within that 1-5MB range that you mention.

P4's have longer pipelines, wider cache lines, and higher FSB/memory bus.
it's great for streaming, but suffer greater penalty whenever there's a
cache miss and wrongly predicted branching. this is why P4 benefit more from
larger caches (and suffer more performance loss with reduced cache variants)
than Athlons/Durons.

So, since only about 10% of my operations include branches, it
looks to me like the P4 might be the better choice. Right?

OTOH, I understand that the Athlon has a faster FPU ...

I'm so confused.

Well, from what I've seen, 7% div is enough to break the P4. Even
using vectorized SSE2 optimization, the AthlonXP sails past even using
old '387 code.

AMD and Intel (post PentiumIII) architectures are wildly different. It
seems to me, extremely hard to make comparisons, that are valid in
correlation to real application performance. I've also come to
realize, that most (all?) synthetic benchmarks are useless as well.
Bottom line is, run the application and see. Some general big guesses
can be made, and is what I've tried to make, in this thread.

Much can be done with optimization for the P4. But my take is that the
Northwood/Prescott cores are better geared for media than
math/science/engineering. Sure, a lot of things are just matrix mul,
and P4s can be made to do that blazing fast. So if your code spends
most of the time doing things like that, SSE2 should make a hell of a
difference.

But the AMDs doesn't have any weak spots. They just crunch away, when
a P4 grinds to halt.
All benchmarks are optimized for the P4. But only mainstream
applications seem to be.
I've had two disappointing P4 experiences, and I think I'm firmly in
the AMD camp now. I recommend you not to invest any money in any P4
system, before trying out your software on one.

You can go that route: Change only motherboard + memory, so you literally
open up a major bottleneck. [...] If you are still not satisfied from
improvement, go and buy whatever model you can afford of Athlon (barton) XP:
2500+ to 3200+.

Yes; interestingly enough, after I posted my message I found
http://www.xbitlabs.com/articles/cpu...4-3200_14.html,
where a standard XP3200+ did remarkably well against the A64FX and
the P4EE in mathematical analysis benchmarks -- exactly the sorts
of things that I am doing.

It partly depends on the code. The AthlonXP does indeed have the most
powerful '387 FPU in existence. Even more powerful than the K8s'.
But K8s' (Opteron, Athlon64, AthlonFX) vector math unit is even more
powerful, even on scalar FP. So the AMD game plan is that even scalar
math should be compiled for that instead.
Intel's P4 plan is similar, even scalar ops are redirected to SSE2 by
their compiler. But the P4 doesn't shine on scalar FP.

Also the AthlonXP can also do better than '387 for vectorized
operations.
You have the interesting possibility of optimizing your code for
'enhanced 3DNow'. I don't know how to do that, I'm lazy and use old
and cheap tools. But check AMD's web site for developer information.
This 'enhanced 3Dnow' supposedly comes to like 80% of P4's SSE2 max
performance, but is supposed to not have the same sensitivity to
fp-mix and branches.
Even though the AthlonXP supports SSE, enhanced 3DNow should be
better. SSE makes the Athlon look better on PIII optimized code, but
isn't the optimum.
(I think some big corp, Lockheed or Boeing, built a supercomputer from
AthlonXPs, for the sole purpose of using enhanced 3DNow for
aerodynamic calculations.)

But if I go with the XP3200+, what do I look for on the
motherboard in terms of "DDR" vs. "dual DDR"? I have looked at
motherboard specs, and "dual DDR" capability seldom seems to be
mentioned. Is it even a concern in my situation?

Dual channel actually is slightly, slightly faster, even on the
AthlonXPs. But they don't make the same use of it, as 800MHz fsb P4s.
It is often regarded as insignificant (for Athlons), particularly in
comparison with later single channel chipsets, like KT600.

Then it is possible that the bottleneck would be the HD
subsystem, since loading large amounts of data into a fast RAM will task the
slow HD.

Actually, mine is a streaming application. When the program is
running, the hard drive spins down due to inactivity!

Either I or you are confused here, because that is not what I
understand by 'streaming'. I think what is meant by 'streaming', is
that input comes directly from output of preceding op. In the case of
P4, I interpret it as generalized to - when you have 'next input and
op ready at hand'. Basically, that there's no conditional statements,
and that everything to be done, for very large continuous segments of
processing, is fixed, and data is continuous. Like
moving/factoring/adding/transforming large data blocks.

P.S.
There have been repeated references to P4 Extreme here. I want to warn
against the P4EE (extreme edition). It costs around $1000, and while
it does do 15% better on some, it doesn't average more than 3% better
on benchmarks, than a vanilla 3.2P4C. (I guess that'll be something
like 1% on actual applications...). In my mind, if you're that
desperate, it's much more tempting to spend all those money on
cpu-freezing and serious overclocking. Sole reason for the P4EE
existence, is an Intel marketing plan to confuse the market about
AMD's Athlon64.

There is also the P4E. Don't confuse them. This is the new 'Prescott'
core. Unfortunately, it's something like 4-9% slower than P4C per
clockrate. It's engineered for higher clockrates, but it's even more
inefficient than the Northwood. The P4 of choice, IMO, and I'm much
surer of that than anything else, remains the P4C for now. Even more
so with price cuts. It may all have changed when we reach 4GHz, but
early Prescott buyers are suckers.

Final words: If I'd dared recommend anything at all, it would probably
be the new Athlon64s. If memory speed is important, socket 939
(currently still unavailable), otherwise socket 754 seem to be doing
well enough.

ancra

#18 February 7th 04, 01:06 AM

On Sat, 07 Feb 2004 01:29:56 +0100, in wrote:

Nope!
All cpus since the original Pentium have 64-bit data bus. I think what
you're looking for is dual channel or 128 bit buses.
Socket 939 and 940 Athlon64s have 128bit bus

I guess I was just mistaken on that. Thanks for straightening me
out.

No P4 has dual channel memory bus. But some mobos/chipsets have
dualchannel bus.

You know what? The whole thing sounds like a massive kludge!
Isn't anything straightforward?

As long as it's sequential huge blocks of data that is moved about, or
done fairly simple operations on, the P4 does very well with its
memory bandwidth.

Well, the data blocks are sequential, but the operations are far
from simple!

But I can't answer your question regarding DDR400 AthlonXP vs 800MHz
P4. The Athlons have lower bandwidth, but also very big L1 cache and
vastly superior branch handling and out of order execution.

It would seem that the only way to find out which is better for my
application is to try them both and see which wins.
Unfortunately, that's exactly the scenario I'm trying to avoid.

Some additional information: AMD Opteron, Athlon64 and AthlonFX are
64-bit CPUs. all other are 32-bit. The significance of these bits, are
the address width of the cpu instructions, not any width of data.

Now I understand. Given that I thought that the data buses in
previous models were 32 bit, the step up to 64 bits seemed to be
quite profound. But now I see that it's not such a big deal.

Well, from what I've seen, 7% div is enough to break the P4. Even
using vectorized SSE2 optimization, the AthlonXP sails past even using
old '387 code.

Wow. I've minimized the use of division in my code, but it's just
a significant part of the analysis that I do.

Sure, a lot of things are just matrix mul,
and P4s can be made to do that blazing fast. So if your code spends
most of the time doing things like that, SSE2 should make a hell of a
difference.

Portions of my code can be configured as matrix multiplications,
if need be. Are SSE2 instructions double precision?

It partly depends on the code. The AthlonXP does indeed have the most
powerful '387 FPU in existence. Even more powerful than the K8s'.

That says a lot. I wonder if newer compilers can optimize for
Athlon XP, or are they still limited to Pentium derivatives?

But K8s' (Opteron, Athlon64, AthlonFX) vector math unit is even more
powerful, even on scalar FP. So the AMD game plan is that even scalar
math should be compiled for that instead.

Okay; I know what a vector is and what a scalar is, in a physics
context. What are they in an FPU context?

You have the interesting possibility of optimizing your code for
'enhanced 3DNow'.

Again, are 3DNow instructions double precision?

Dual channel actually is slightly, slightly faster, even on the
AthlonXPs. But they don't make the same use of it, as 800MHz fsb P4s.
It is often regarded as insignificant (for Athlons), particularly in
comparison with later single channel chipsets, like KT600.

That would explain the lack of emphasis on Dual DDR in Athlon XP
products that I've seen.

Either I or you are confused here, because that is not what I
understand by 'streaming'. I think what is meant by 'streaming', is
that input comes directly from output of preceding op. In the case of
P4, I interpret it as generalized to - when you have 'next input and
op ready at hand'. Basically, that there's no conditional statements,
and that everything to be done, for very large continuous segments of
processing, is fixed, and data is continuous. Like
moving/factoring/adding/transforming large data blocks.

No, what I meant was that my data sets do not reside on the hard
drive, to be read in blocks. (Although right now they do, because
I cannot run fast enough to complete in real time.) In my case,
data will be streaming-in at a constant rate, processed, and
output at the same rate that it comes in. Of course, the
continuous input data are buffered and processed in blocks, but
overall the average output rate has to equal the average input
rate.

There have been repeated references to P4 Extreme here. I want to warn
against the P4EE (extreme edition). [...] Sole reason for the P4EE
existence, is an Intel marketing plan to confuse the market about
AMD's Athlon64.

It worked on me! Thanks for the info.

The P4 of choice, IMO, and I'm much
surer of that than anything else, remains the P4C for now.

Which Pentiums (Pentia?) are of the "C" type?

Final words: If I'd dared recommend anything at all, it would probably
be the new Athlon64s. If memory speed is important, socket 939
(currently still unavailable), otherwise socket 754 seem to be doing
well enough.

Thanks. The A64FX, while expensive, looks very good. Failing
that, I'm leaning toward the XP3200+, or whatever derivative is
available when I have saved up enough of my lunch money to buy
something to replace my XP2100+.

Many thanks,
GB

#19 February 8th 04, 02:58 PM

On Sat, 07 Feb 2004 01:06:55 GMT, Greg Berchin
wrote:

Well, the data blocks are sequential, but the operations are far
from simple!

Well, simple and simple... The Intel compilers autovectorizing seem
quite capable of coming up with clever and complex tricks. The
important thing is avoiding things like underflow/overflow, division,
evaluations and branches to indian country or hell.

Some additional information: AMD Opteron, Athlon64 and AthlonFX are
64-bit CPUs. all other are 32-bit. The significance of these bits, are
the address width of the cpu instructions, not any width of data.

Now I understand. Given that I thought that the data buses in
previous models were 32 bit, the step up to 64 bits seemed to be
quite profound. But now I see that it's not such a big deal.

- Oh? - It's one hell of a big deal!! But not as long as you're still
running 32-bit software.

Well, from what I've seen, 7% div is enough to break the P4. Even
using vectorized SSE2 optimization, the AthlonXP sails past even using
old '387 code.

Wow. I've minimized the use of division in my code, but it's just
a significant part of the analysis that I do.

Still, the important thing is not how much division the source
contains, but how often it will execute. And to what degree it will
interfere with large timeconsuming operations. I'd still not rule out
a good improvement from SSE2. Particularly if you can, sort of,
isolate divisions. Doing as much as possibly of the rest in a 'clean'
context.

Sure, a lot of things are just matrix mul,
and P4s can be made to do that blazing fast. So if your code spends
most of the time doing things like that, SSE2 should make a hell of a
difference.

Portions of my code can be configured as matrix multiplications,
if need be. Are SSE2 instructions double precision?

Yes, it handles DP too, but at half the speed.

It partly depends on the code. The AthlonXP does indeed have the most
powerful '387 FPU in existence. Even more powerful than the K8s'.

That says a lot. I wonder if newer compilers can optimize for
Athlon XP, or are they still limited to Pentium derivatives?

- Aah... '387, that would be unoptimized.

But K8s' (Opteron, Athlon64, AthlonFX) vector math unit is even more
powerful, even on scalar FP. So the AMD game plan is that even scalar
math should be compiled for that instead.

Okay; I know what a vector is and what a scalar is, in a physics
context. What are they in an FPU context?

It's like this: Suppose you have a matrix A, to be factored with
koefficient q. This is an operation that can be vectorized. A's
elements are consecutive. So instead of running
q X Ai = qAi
for the all the elements, which is as many operations as the number of
elements in the matrix, it's sent to the SSE2 execution unit instead.
This has 128 bit long registers. And it can perform either 2x64 bit
operations or 4x32 bit fp-operations. So we shovel in the matrix in
128 bit segments and we do
q X [Ai, Aj, Ak, Al] = [qAi, qAj, qAk, qAl]
instead. This example is single precision of course. And we do it in
one fourth of the time, doing only one fourth of the number of
operations. Double precision would look like
q X [Ai, Aj] = [qAi, qAj]
And we don't get quite the same speed advantage.

AMD's K8s have SSE2 too, like the P4s. As I said, the optimizing
compiler sends even scalar math to SSE2 (because it's execution unit
is more powerful than '387's), wasting vector fields.
q X [a, --- ] = [qa, --- ]
For some reasons, this, scalar SSE2 FP, is more powerful on K8, than
on the P4, despite that the P4 is still better on vector operations
(as long as it's 32-bit code). Sounds contradicting, but it has
something to do with that the P4 can really use its high clock, when
data comes in large, flat blocks.
But again, I've seen 64-bit code benchmarks where the K8 does a
massive pickup on 64-SSE2. Can't figure out why, because it's still
the same 128-bit length. There's got to be some clever tricks on how
64-SSE2 uses registers. While 32-SSE2 has to be compatible with Intels
implementation, of course.

You have the interesting possibility of optimizing your code for
'enhanced 3DNow'.

Again, are 3DNow instructions double precision?

First, '3DNow' and 'enhanced 3DNow' are two different instruction
extensions, just like SSE and SSE2. And yes, 'enhanced 3DNow' handles
DP. ...And no, old '3DNow' was just SP.
SSE2 has one execution unit for 128 bit long vectors.
'enhanced 3DNow' has two parallell execution units, each for 64 bit
long vectors.
Both SSE2 and enhanced 3DNow handles a variety of integer fields and
ops, as well as SP and DP FP-math.

Which Pentiums (Pentia?) are of the "C" type?

Oh, they tell you if it's 'C', when they sell it.
But basically, Northwood core with 800MHz fsb and HT.

Besides the 800MHz fsb, it also has this feature Hyper Threading. This
looks very interesting, as a hardware solution to MS Windows poor
sheduling/multitasking. Early HT benchmarks was hoe-hum. But I've
recently seen for myself, that it makes wonders for Windows
multitasking response. It even had me a bit excited... :-D.

Final words: If I'd dared recommend anything at all, it would probably
be the new Athlon64s. If memory speed is important, socket 939
(currently still unavailable), otherwise socket 754 seem to be doing
well enough.

Thanks. The A64FX, while expensive, looks very good.

I wasn't recommending the FX. That goes, sort of, into the same poor
value cathegory as the P4EE.
I recommended the, yet unavailable, socket 939 Athlon64, and the
availabe socket 754 Athlon64.

ancra

#20 February 9th 04, 02:00 AM

On Sun, 08 Feb 2004 15:58:03 +0100, in wrote:

But now I see that it's not such a big deal.

- Oh? - It's one hell of a big deal!! But not as long as you're still
running 32-bit software.

My bottleneck seems to be memory bandwidth. If the older
processors had 32 bit data buses, then going to 64 bit would be a
big deal. But since they have 64 bit buses, and the new
processors also have 64 bit buses, it's not such a big deal to me.

Still, the important thing is not how much division the source
contains, but how often it will execute. And to what degree it will
interfere with large timeconsuming operations. I'd still not rule out
a good improvement from SSE2. Particularly if you can, sort of,
isolate divisions. Doing as much as possibly of the rest in a 'clean'
context.

Just this morning I rearranged everything so that all "div"
operations were isolated. I got an improvement of just 0.8%. Of
course, this is with standard 387 instructions.

I have a friend with a 1.7 GHz P4 machine, don't know his memory
speed. My XP2100+ machine runs the same application 19% faster
than his (2814 seconds vs. 2792, for the short test that we
tried). Remember that my system doesn't have DDR memory.

Okay; I know what a vector is and what a scalar is, in a physics
context. What are they in an FPU context?

q X [Ai, Aj, Ak, Al] = [qAi, qAj, qAk, qAl]

Got it. Thanks.

Again, are 3DNow instructions double precision?

First, '3DNow' and 'enhanced 3DNow' are two different instruction
extensions, just like SSE and SSE2. And yes, 'enhanced 3DNow' handles
DP. ...And no, old '3DNow' was just SP.

I spent the day poking around the AMD and Microsoft sites, so I'm
getting a feel for the 3DNow! instruction sets. AMD includes a
bunch of 3DNow! and MMX sample code with the Microsoft VC++
Processor Pack 5 -- pretty much everything I need in the form of
inline assembly code, callable from C. Once I figure it all out,
I'm going to try it with my application. I think that 32-bit
float will be okay.

Which Pentiums (Pentia?) are of the "C" type?

Oh, they tell you if it's 'C', when they sell it.
But basically, Northwood core with 800MHz fsb and HT.

Okay; that's what I was looking at anyway.

Besides the 800MHz fsb, it also has this feature Hyper Threading.

Useless to me. The machine is dedicated only to this one task.

Thanks. The A64FX, while expensive, looks very good.

I wasn't recommending the FX. That goes, sort of, into the same poor
value cathegory as the P4EE.

Aw, crud. I know at least a LITTLE bit about computers, and I got
fooled. What do people who know nothing do?

I recommended the, yet unavailable, socket 939 Athlon64, and the
availabe socket 754 Athlon64.

I'll keep my eyes open.

Many thanks for your help.
Greg

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Balance Point, AGP Overclocking	David B.	Overclocking	6	April 19th 05 01:42 PM
Passmark Performance Test, Division, Floating Point Division, 2DShapes	@(none)	General	0	August 19th 04 11:57 PM
Floating Point Operations & AMD	Keith B. Silverman	Overclocking AMD Processors	1	August 5th 04 02:07 PM
my new mobo o/c's great	rockerrock	Overclocking AMD Processors	9	June 30th 04 08:17 PM
AMD64 vs. a floating point operation (FLOP)	Only NoSpammers	AMD x86-64 Processors	8	June 27th 04 03:55 PM