If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#11
|
|||
|
|||
On Mon, 02 Feb 2004 12:13:51 -0800, Paul Spitalny
wrote: wrote: Hi Ancra, Well, I asked the software vendor (the software that I run to do simulation work with) about the mathematics in their program and this is what they said: Q: Does the code use mostly floating point math operations. If so, then: What _kind_ of floating point math is it? Is it compiled to old fashioned '387 operations? A: Yes. We compile generic version which must be supported by most number existing x86 processors as possible. As result we don't optimize Sthe code for particular x86 instruction set extension. Q: Or is it autovectorized/optimized for SSE2? A: No. Q: Is it double precision or single? A: Double as in original Berkeley Spice 3. Q: Does it contain division, how much? A: It's hard to tell. It's depend on what you want to simulate with SmartSpice. Q: Is there a lot of conditional instructions, branches? A: Sure. Q: Is the code (for windows) compiled with Intels auto vectorizing optimizing compiler? A: No. That being the case I wonder how to proceed. I can halp but think that the newest "extreme" pentium (now up to 3.4Ghz clock and 800MHz FSB) has got to be significantly faster than my older 2.5GHz pentium 4 (with RAMBUS memory). The "extreme" processor has 1Meg of L2 cache and you would think that'd help too. Or, do you feel like the AMD chips might be better since they are known for better performance at floating point? You see, the guys I get my software from, as they mention above, don't compile for specific processors or to optimixe performance. By the way, thank you for your response to my posting!! Paul; If it isn't optimize for P4, the AMD chips will be noticably faster. AMD has a much better general purpose FPU. I would go that way for the best floating point performance. I doubt the extra L2 cache would have a noticable difference in performance on a FPU intensive application. JT |
#12
|
|||
|
|||
Greg Berchin writes:
Most of my computer work involves simulations that bring the processor to its knees (doing floating point math). I've been watching the answers to this question, because I am in a somewhat similar situation. I have some VERY floating-point intensive analysis programs that typically run for several hours on an Athlon XP2100+. These programs operate upon huge arrays of data, so I suspect that the choke point in my situation is memory bandwidth -- I am using an old ABIT KT7A that only supports SDRAM at 133 MHz. I've got a 2000 with DDR at 266 Mhz, running XP on an ECS board. I typically sit in Mathematica having it grind away on things for hours or days. If you had a fairly reasonable job to run I'd consider timing it so you could compare what DDR would do for you. (I'd guess that this would make a fairly small contribution, in the 10-20% range likely.) Or if it wouldn't take much to code it in Mathematica, the latest version has been substantially speeded up with a variety of processor and job specific optimizations. For the mix of things I do I've often seen 3x gains over the last version, but I don't know how that would compare to carefully optimized code generation using other tools. Email address is valid |
#13
|
|||
|
|||
Don Taylor wrote:
Paul Spitalny writes: Don Taylor wrote: Paul Spitalny writes: Most of my computer work involves simulations that bring the processor to its knees (doing floating point math). Depending on whether you have access to the simulation engine code or not and whether you want to put in the effort or not the floating point digital signal processing chips now routinely provide over 3 gigaflops/second if you can get your code to fit inside the constantly increasing memory that is inside these parts. Unfortunately, I don't have access to the source code. But, your idea is an interesting one....I am not sure I have the expertise to pull it off though! Reading your other posts, I might suggest asking your Spice vendor to tell you how much improvement you are going to get if you switch to a different processor. They certainly should know the answer to this, even if it takes your handing over your spice model to them to run. And if there is money in the budget you might compare the speed of the Spice packages available from a few vendors, again perhaps needing to hand over a copy of your typical model. Hi Don, Unfortunately, and to my surprise, and dismay, the software vendor has not tried their code on various platforms to benchmark it. So, they had no opnion or advice on the best platform to run their code. And, unfortunately, there's not much money in the budget to buy other Spice simulators (already have two of them) SO, I may just have to get my hands on an AND and new P4 machine to see which is fastest. Thanks, Paul |
#14
|
|||
|
|||
Paul Spitalny wrote: Hi, I have a machine with a pentium4 2.52Ghz processor with 1Gig of Rambus memory. I think the bus speed is 500Mhz (or thereabouts)? The machine is about 1.5 years old. The question I have is this: Most of my computer work involves simulations that bring the processor to its knees (doing floating point math). I am wondering if by going to the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz bus) whether I'll get a significant increase in speed beyond the sheer clock speed increase? That is, will the speed improvement only be 3.2Ghz/2.5GHz = 1.28 (28 % speed increase), or, is the architecture and bus speed going to give me much more performance than I currently have?? Thanks, Paul http://www.microsoft.com/windowsxp/6...ds/upgrade.asp |
#15
|
|||
|
|||
"Greg Berchin" wrote in message ... Most of my computer work involves simulations that bring the processor to its knees (doing floating point math). I am wondering if by going to the newest Intel chipset (pentium4 extreme with 3.2HGz clock and 800Mhz bus) whether I'll get a significant increase in speed beyond the sheer clock speed increase? I've been watching the answers to this question, because I am in a somewhat similar situation. I have some VERY floating-point intensive analysis programs that typically run for several hours on an Athlon XP2100+. These programs operate upon huge arrays of data, so I suspect that the choke point in my situation is memory bandwidth -- I am using an old ABIT KT7A that only supports SDRAM at 133 MHz. As for standard 387 vs. SSE vs. SSE2 optimizations, I wrote the programs myself, so I can compile them to use whatever features are available on the particular processor that I use (Visual Studio .NET Pro). Probably 85% of the fp operations are evenly split among mult, add/sub, and trig functions (sin, cos), while the other 15% are div or division-like (arctan, sqrt). About 10% of the instruction streams involve branches. Right now, everything is double-precision, but it might be possible to use single-precision; I haven't tried it. So my variation on the original poster's question is: What high-speed system would best solve my memory bandwidth problems, in addition to my processing power problems? How does a DDR 400 Athlon compare with an 800 MHz fsb P4? If I make the jump to an Athlon 64 or P4 Extreme, will their 64-bit data buses offer as much an advantage as it would seem? I'm fairly computer-savvy, but frankly I've lost track of how to compare memory speeds on the Athlon with those on the Pentium 4. If somebody could point me to a tutorial, it would be much appreciated. Thanks, GB P4EE has a larger cache, not a wider data bus (compared to vnialla P4). Athlon 64 has internally built memory controller, with either 64 or 128 bit data bus (depends on model). If the dataset on which most of the time consuming operations are done, is close in size to 1-5MB, then going to a larger cache veriety CPU can benefit more than just increasing memory bandwith with a similar CPU (cache size wise) or changing system chipset (=memory controller, latency, speed, bandwidth). P4's have longer pipelines, wider cache lines, and higher FSB/memory bus. it's great for streaming, but suffer greater penalty whenever there's a cache miss and wrongly predicted branching. this is why P4 benefit more from larger caches (and suffer more performance loss with reduced cache variants) than Athlons/Durons. You can go that route: Change only motherboard + memory, so you literally open up a major bottleneck. Go with a decent PC3200 or top brand PC2700 (first choice better for future), and a motherboard that will take (out of the box) barton CPUs. Benchmark. If you are still not satisfied from improvement, go and buy whatever model you can afford of Athlon (barton) XP: 2500+ to 3200+. Then it is possible that the bottleneck would be the HD subsystem, since loading large amounts of data into a fast RAM will task the slow HD. The other route worth persuing is going Athlon64. Get either the 3000+ model (with somwhat reduced cache) or the 3200+ model, and a K8T600 or nForce3-150 chipset motherboard, and you are futureproof for quite some time. |
#16
|
|||
|
|||
On Fri, 6 Feb 2004 08:54:49 +0200, "Erez Volach"
wrote: P4EE has a larger cache, not a wider data bus (compared to vnialla P4). Um; now I'm really confused. I thought that the P4EE had a 64 bit data bus, compared with 32 bit on the regular P, P2, P3, P4, Athlon, and Duron. Is that not correct? Athlon 64 has internally built memory controller, with either 64 or 128 bit data bus (depends on model). I read that the A64 has a 64 bit data bus and a "single data channel", while the A64FX has a 64 bit data bus and a "double data channel". I'm not sure what is meant by a "data channel" in this context, but is that what YOU mean? http://www.nordichardware.com/review...2GHz/index.php If the dataset on which most of the time consuming operations are done, is close in size to 1-5MB, then going to a larger cache veriety CPU can benefit more than just increasing memory bandwith with a similar CPU (cache size wise) or changing system chipset (=memory controller, latency, speed, bandwidth). My data sets are between 256 Kwords and 512 Kwords, where a word is a 64 bit double precision float. So it looks like I fall within that 1-5MB range that you mention. P4's have longer pipelines, wider cache lines, and higher FSB/memory bus. it's great for streaming, but suffer greater penalty whenever there's a cache miss and wrongly predicted branching. this is why P4 benefit more from larger caches (and suffer more performance loss with reduced cache variants) than Athlons/Durons. So, since only about 10% of my operations include branches, it looks to me like the P4 might be the better choice. Right? OTOH, I understand that the Athlon has a faster FPU ... I'm so confused. You can go that route: Change only motherboard + memory, so you literally open up a major bottleneck. [...] If you are still not satisfied from improvement, go and buy whatever model you can afford of Athlon (barton) XP: 2500+ to 3200+. Yes; interestingly enough, after I posted my message I found http://www.xbitlabs.com/articles/cpu...4-3200_14.html, where a standard XP3200+ did remarkably well against the A64FX and the P4EE in mathematical analysis benchmarks -- exactly the sorts of things that I am doing. But if I go with the XP3200+, what do I look for on the motherboard in terms of "DDR" vs. "dual DDR"? I have looked at motherboard specs, and "dual DDR" capability seldom seems to be mentioned. Is it even a concern in my situation? Then it is possible that the bottleneck would be the HD subsystem, since loading large amounts of data into a fast RAM will task the slow HD. Actually, mine is a streaming application. When the program is running, the hard drive spins down due to inactivity! Many thanks for your comments. GB |
#17
|
|||
|
|||
On Fri, 06 Feb 2004 15:07:44 GMT, Greg Berchin
wrote: On Fri, 6 Feb 2004 08:54:49 +0200, "Erez Volach" wrote: P4EE has a larger cache, not a wider data bus (compared to vnialla P4). Um; now I'm really confused. I thought that the P4EE had a 64 bit data bus, compared with 32 bit on the regular P, P2, P3, P4, Athlon, and Duron. Is that not correct? Nope! All cpus since the original Pentium have 64-bit data bus. I think what you're looking for is dual channel or 128 bit buses. Socket 939 and 940 Athlon64s have 128bit bus No P4 has dual channel memory bus. But some mobos/chipsets have dualchannel bus. But the bus from memory controller (Northbridge) to cpu fsb is 64 bit. Same thing with AthlonXPs. But P4C at 800FSB can use the dual channel better than AthlonXPs. P4EE is nothing but a P4 with a large L3 cache. Not a L2 cache! So it doesn't benefit quite that much from it. Provided you use a dual channel DDR400 mobo, and not more than two memory sticks, the P4C's memory bandwidth is much better. L2 cache latency is also much better on the P4. That's the easy part of the answer. Unfortunately, the P4 often seem to have problems translating those advantages into better real world performance. As long as it's sequential huge blocks of data that is moved about, or done fairly simple operations on, the P4 does very well with its memory bandwidth. But I can't answer your question regarding DDR400 AthlonXP vs 800MHz P4. The Athlons have lower bandwidth, but also very big L1 cache and vastly superior branch handling and out of order execution. The Athlon64, in turn, memory latency is much superior. Memory bandwidth of the socket 939 and 940 AMD '86-64 cpus should also be better. Some additional information: AMD Opteron, Athlon64 and AthlonFX are 64-bit CPUs. all other are 32-bit. The significance of these bits, are the address width of the cpu instructions, not any width of data. Plus that the 64-bit instructions are extended to use more registers, and in a more rational manner. In all discussions and benchmarks, sofar, these Athlon64s are treated and used as 32-bit cpus, using 32-bit OS and 32-bit software. Even so, they still kick ass. With 64-bit software, they should really start to look interesting. Athlon 64 has internally built memory controller, with either 64 or 128 bit data bus (depends on model). I read that the A64 has a 64 bit data bus and a "single data channel", while the A64FX has a 64 bit data bus and a "double data channel". I'm not sure what is meant by a "data channel" in this context, but is that what YOU mean? http://www.nordichardware.com/review...2GHz/index.php If the dataset on which most of the time consuming operations are done, is close in size to 1-5MB, then going to a larger cache veriety CPU can benefit more than just increasing memory bandwith with a similar CPU (cache size wise) or changing system chipset (=memory controller, latency, speed, bandwidth). My data sets are between 256 Kwords and 512 Kwords, where a word is a 64 bit double precision float. So it looks like I fall within that 1-5MB range that you mention. P4's have longer pipelines, wider cache lines, and higher FSB/memory bus. it's great for streaming, but suffer greater penalty whenever there's a cache miss and wrongly predicted branching. this is why P4 benefit more from larger caches (and suffer more performance loss with reduced cache variants) than Athlons/Durons. So, since only about 10% of my operations include branches, it looks to me like the P4 might be the better choice. Right? OTOH, I understand that the Athlon has a faster FPU ... I'm so confused. Well, from what I've seen, 7% div is enough to break the P4. Even using vectorized SSE2 optimization, the AthlonXP sails past even using old '387 code. AMD and Intel (post PentiumIII) architectures are wildly different. It seems to me, extremely hard to make comparisons, that are valid in correlation to real application performance. I've also come to realize, that most (all?) synthetic benchmarks are useless as well. Bottom line is, run the application and see. Some general big guesses can be made, and is what I've tried to make, in this thread. Much can be done with optimization for the P4. But my take is that the Northwood/Prescott cores are better geared for media than math/science/engineering. Sure, a lot of things are just matrix mul, and P4s can be made to do that blazing fast. So if your code spends most of the time doing things like that, SSE2 should make a hell of a difference. But the AMDs doesn't have any weak spots. They just crunch away, when a P4 grinds to halt. All benchmarks are optimized for the P4. But only mainstream applications seem to be. I've had two disappointing P4 experiences, and I think I'm firmly in the AMD camp now. I recommend you not to invest any money in any P4 system, before trying out your software on one. You can go that route: Change only motherboard + memory, so you literally open up a major bottleneck. [...] If you are still not satisfied from improvement, go and buy whatever model you can afford of Athlon (barton) XP: 2500+ to 3200+. Yes; interestingly enough, after I posted my message I found http://www.xbitlabs.com/articles/cpu...4-3200_14.html, where a standard XP3200+ did remarkably well against the A64FX and the P4EE in mathematical analysis benchmarks -- exactly the sorts of things that I am doing. It partly depends on the code. The AthlonXP does indeed have the most powerful '387 FPU in existence. Even more powerful than the K8s'. But K8s' (Opteron, Athlon64, AthlonFX) vector math unit is even more powerful, even on scalar FP. So the AMD game plan is that even scalar math should be compiled for that instead. Intel's P4 plan is similar, even scalar ops are redirected to SSE2 by their compiler. But the P4 doesn't shine on scalar FP. Also the AthlonXP can also do better than '387 for vectorized operations. You have the interesting possibility of optimizing your code for 'enhanced 3DNow'. I don't know how to do that, I'm lazy and use old and cheap tools. But check AMD's web site for developer information. This 'enhanced 3Dnow' supposedly comes to like 80% of P4's SSE2 max performance, but is supposed to not have the same sensitivity to fp-mix and branches. Even though the AthlonXP supports SSE, enhanced 3DNow should be better. SSE makes the Athlon look better on PIII optimized code, but isn't the optimum. (I think some big corp, Lockheed or Boeing, built a supercomputer from AthlonXPs, for the sole purpose of using enhanced 3DNow for aerodynamic calculations.) But if I go with the XP3200+, what do I look for on the motherboard in terms of "DDR" vs. "dual DDR"? I have looked at motherboard specs, and "dual DDR" capability seldom seems to be mentioned. Is it even a concern in my situation? Dual channel actually is slightly, slightly faster, even on the AthlonXPs. But they don't make the same use of it, as 800MHz fsb P4s. It is often regarded as insignificant (for Athlons), particularly in comparison with later single channel chipsets, like KT600. Then it is possible that the bottleneck would be the HD subsystem, since loading large amounts of data into a fast RAM will task the slow HD. Actually, mine is a streaming application. When the program is running, the hard drive spins down due to inactivity! Either I or you are confused here, because that is not what I understand by 'streaming'. I think what is meant by 'streaming', is that input comes directly from output of preceding op. In the case of P4, I interpret it as generalized to - when you have 'next input and op ready at hand'. Basically, that there's no conditional statements, and that everything to be done, for very large continuous segments of processing, is fixed, and data is continuous. Like moving/factoring/adding/transforming large data blocks. P.S. There have been repeated references to P4 Extreme here. I want to warn against the P4EE (extreme edition). It costs around $1000, and while it does do 15% better on some, it doesn't average more than 3% better on benchmarks, than a vanilla 3.2P4C. (I guess that'll be something like 1% on actual applications...). In my mind, if you're that desperate, it's much more tempting to spend all those money on cpu-freezing and serious overclocking. Sole reason for the P4EE existence, is an Intel marketing plan to confuse the market about AMD's Athlon64. There is also the P4E. Don't confuse them. This is the new 'Prescott' core. Unfortunately, it's something like 4-9% slower than P4C per clockrate. It's engineered for higher clockrates, but it's even more inefficient than the Northwood. The P4 of choice, IMO, and I'm much surer of that than anything else, remains the P4C for now. Even more so with price cuts. It may all have changed when we reach 4GHz, but early Prescott buyers are suckers. Final words: If I'd dared recommend anything at all, it would probably be the new Athlon64s. If memory speed is important, socket 939 (currently still unavailable), otherwise socket 754 seem to be doing well enough. ancra |
#18
|
|||
|
|||
|
#19
|
|||
|
|||
On Sat, 07 Feb 2004 01:06:55 GMT, Greg Berchin
wrote: Well, the data blocks are sequential, but the operations are far from simple! Well, simple and simple... The Intel compilers autovectorizing seem quite capable of coming up with clever and complex tricks. The important thing is avoiding things like underflow/overflow, division, evaluations and branches to indian country or hell. Some additional information: AMD Opteron, Athlon64 and AthlonFX are 64-bit CPUs. all other are 32-bit. The significance of these bits, are the address width of the cpu instructions, not any width of data. Now I understand. Given that I thought that the data buses in previous models were 32 bit, the step up to 64 bits seemed to be quite profound. But now I see that it's not such a big deal. - Oh? - It's one hell of a big deal!! But not as long as you're still running 32-bit software. Well, from what I've seen, 7% div is enough to break the P4. Even using vectorized SSE2 optimization, the AthlonXP sails past even using old '387 code. Wow. I've minimized the use of division in my code, but it's just a significant part of the analysis that I do. Still, the important thing is not how much division the source contains, but how often it will execute. And to what degree it will interfere with large timeconsuming operations. I'd still not rule out a good improvement from SSE2. Particularly if you can, sort of, isolate divisions. Doing as much as possibly of the rest in a 'clean' context. Sure, a lot of things are just matrix mul, and P4s can be made to do that blazing fast. So if your code spends most of the time doing things like that, SSE2 should make a hell of a difference. Portions of my code can be configured as matrix multiplications, if need be. Are SSE2 instructions double precision? Yes, it handles DP too, but at half the speed. It partly depends on the code. The AthlonXP does indeed have the most powerful '387 FPU in existence. Even more powerful than the K8s'. That says a lot. I wonder if newer compilers can optimize for Athlon XP, or are they still limited to Pentium derivatives? - Aah... '387, that would be unoptimized. But K8s' (Opteron, Athlon64, AthlonFX) vector math unit is even more powerful, even on scalar FP. So the AMD game plan is that even scalar math should be compiled for that instead. Okay; I know what a vector is and what a scalar is, in a physics context. What are they in an FPU context? It's like this: Suppose you have a matrix A, to be factored with koefficient q. This is an operation that can be vectorized. A's elements are consecutive. So instead of running q X Ai = qAi for the all the elements, which is as many operations as the number of elements in the matrix, it's sent to the SSE2 execution unit instead. This has 128 bit long registers. And it can perform either 2x64 bit operations or 4x32 bit fp-operations. So we shovel in the matrix in 128 bit segments and we do q X [Ai, Aj, Ak, Al] = [qAi, qAj, qAk, qAl] instead. This example is single precision of course. And we do it in one fourth of the time, doing only one fourth of the number of operations. Double precision would look like q X [Ai, Aj] = [qAi, qAj] And we don't get quite the same speed advantage. AMD's K8s have SSE2 too, like the P4s. As I said, the optimizing compiler sends even scalar math to SSE2 (because it's execution unit is more powerful than '387's), wasting vector fields. q X [a, --- ] = [qa, --- ] For some reasons, this, scalar SSE2 FP, is more powerful on K8, than on the P4, despite that the P4 is still better on vector operations (as long as it's 32-bit code). Sounds contradicting, but it has something to do with that the P4 can really use its high clock, when data comes in large, flat blocks. But again, I've seen 64-bit code benchmarks where the K8 does a massive pickup on 64-SSE2. Can't figure out why, because it's still the same 128-bit length. There's got to be some clever tricks on how 64-SSE2 uses registers. While 32-SSE2 has to be compatible with Intels implementation, of course. You have the interesting possibility of optimizing your code for 'enhanced 3DNow'. Again, are 3DNow instructions double precision? First, '3DNow' and 'enhanced 3DNow' are two different instruction extensions, just like SSE and SSE2. And yes, 'enhanced 3DNow' handles DP. ...And no, old '3DNow' was just SP. SSE2 has one execution unit for 128 bit long vectors. 'enhanced 3DNow' has two parallell execution units, each for 64 bit long vectors. Both SSE2 and enhanced 3DNow handles a variety of integer fields and ops, as well as SP and DP FP-math. Which Pentiums (Pentia?) are of the "C" type? Oh, they tell you if it's 'C', when they sell it. But basically, Northwood core with 800MHz fsb and HT. Besides the 800MHz fsb, it also has this feature Hyper Threading. This looks very interesting, as a hardware solution to MS Windows poor sheduling/multitasking. Early HT benchmarks was hoe-hum. But I've recently seen for myself, that it makes wonders for Windows multitasking response. It even had me a bit excited... :-D. Final words: If I'd dared recommend anything at all, it would probably be the new Athlon64s. If memory speed is important, socket 939 (currently still unavailable), otherwise socket 754 seem to be doing well enough. Thanks. The A64FX, while expensive, looks very good. I wasn't recommending the FX. That goes, sort of, into the same poor value cathegory as the P4EE. I recommended the, yet unavailable, socket 939 Athlon64, and the availabe socket 754 Athlon64. ancra |
#20
|
|||
|
|||
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Balance Point, AGP Overclocking | David B. | Overclocking | 6 | April 19th 05 01:42 PM |
Passmark Performance Test, Division, Floating Point Division, 2DShapes | @(none) | General | 0 | August 19th 04 11:57 PM |
Floating Point Operations & AMD | Keith B. Silverman | Overclocking AMD Processors | 1 | August 5th 04 02:07 PM |
my new mobo o/c's great | rockerrock | Overclocking AMD Processors | 9 | June 30th 04 08:17 PM |
AMD64 vs. a floating point operation (FLOP) | Only NoSpammers | AMD x86-64 Processors | 8 | June 27th 04 03:55 PM |