Additional registers in x86-64

#1 August 13th 04, 10:03 AM

A few weeks ago, AMD published the SPECint2000 score for the FX-53:
http://www.spec.org/cpu2000/results/...628-03181.html

SPECint2000_peak = 1700
SPECint2000_base = 1601

I see that they used Intel's compiler on Windows XP Professional. Please
correct me if I am wrong. Windows XP is a 32-bit OS, thus the benchmarks
did not use the 8 additional general purpose registers defined in the
x86-64 instruction set, right?

I imagine that, even with 8 more registers available, gcc cannot
outperform Intel's compiler and Microsoft libraries on integer code?

I also noticed Sun's recent SPECfp2000 submission for the Opteron 150:
http://www.spec.org/cpu2000/results/...712-03241.html

SPECfp2000_peak = 1787
SPECfp2000_base = 1637

Sun did use a 64-bit OS, and it seems they compiled most benchmarks as
64-bit applications. I imagine the compiler (most often PathScale)
produced SIMD code to use the XMM registers?

In short, I am wondering how much improvement the 8 additional GPRs and
8 additional media registers bring...

--
Regards, Grumble

#2 August 14th 04, 09:43 AM

On Fri, 13 Aug 2004 11:03:06 +0200, Grumble wrote:

A few weeks ago, AMD published the SPECint2000 score for the FX-53:
http://www.spec.org/cpu2000/results/...628-03181.html

SPECint2000_peak = 1700
SPECint2000_base = 1601

I see that they used Intel's compiler on Windows XP Professional. Please
correct me if I am wrong. Windows XP is a 32-bit OS, thus the benchmarks
did not use the 8 additional general purpose registers defined in the
x86-64 instruction set, right?

That is correct.

I imagine that, even with 8 more registers available, gcc cannot
outperform Intel's compiler and Microsoft libraries on integer code?

Correct again. The optimizations in GCC are not as good as those in
Intel's compiler, though the difference is generally not huge. Take a
look at the results AMD published for their 'A4800' systems. These
are a bunch of Opteron 144 (1.8GHz) processors running under a variety
of different OSes and using different compilers. The fastest results
they achieved was 1095 using Win2K3 (32-bit OS) + Intel's (32-bit)
compiler. For comparison, SuSE 8 for AMD64 (64-bit OS) + GCC 3.3
(64-bit) they managed 1045, and with SuSE 8 for x86 (32-bit OS) + GCC
3.3 for x86 (32-bit compiler) they turned in a score of 960.

So, in the end AMD showed an 8.8% improvement by going from 32 to
64-bit code, but they saw a 14% improvement going from Linux + GCC
(32-bit ) to Windows + Intel C (also 32-bit).

I also noticed Sun's recent SPECfp2000 submission for the Opteron 150:
http://www.spec.org/cpu2000/results/...712-03241.html

SPECfp2000_peak = 1787
SPECfp2000_base = 1637

Sun did use a 64-bit OS, and it seems they compiled most benchmarks as
64-bit applications. I imagine the compiler (most often PathScale)
produced SIMD code to use the XMM registers?

Presumably yes, it would use SIMD code, the XMM registers and the
extra 8 integer registers (even with FP code you still need some
integer registers).

In short, I am wondering how much improvement the 8 additional GPRs and
8 additional media registers bring...

Usually more than enough to make up for the performance loss you would
expect with 64-bit code. Normally, if all else is equal, 64-bit code
is about 5-10% slower than 32-bit code until you blow your memory
limits, at which point 32-bit code just completely breaks down.
That's why most bi-arch systems still use lots of 32-bit applications
if they can, eg Sun's Solaris.

With AMD64 the extra registers have managed to improve the performance
enough that they not only negate this performance loss, but turn it
into a 5-10% performance gain on average. Not bad at all for a fairly
small cost in die space and virtually no changes to the instruction
set. FWIW the reason why AMD only went to 16 registers (still a
pretty low number as compared to most modern processors) is that this
is the most that they could squeeze into the x86 instruction set
without making fairly major changes (they did a pretty damn good job
of this, obviously they actually put some thought into how to extend
x86 to 64-bits as naturally as possible).

-------------
Tony Hill
hilla underscore 20 at yahoo dot ca

#3 August 15th 04, 08:46 AM

Tony Hill wrote:
Correct again. The optimizations in GCC are not as good as those in
Intel's compiler, though the difference is generally not huge. Take a
look at the results AMD published for their 'A4800' systems. These
are a bunch of Opteron 144 (1.8GHz) processors running under a variety
of different OSes and using different compilers. The fastest results
they achieved was 1095 using Win2K3 (32-bit OS) + Intel's (32-bit)
compiler. For comparison, SuSE 8 for AMD64 (64-bit OS) + GCC 3.3
(64-bit) they managed 1045, and with SuSE 8 for x86 (32-bit OS) + GCC
3.3 for x86 (32-bit compiler) they turned in a score of 960.

So, in the end AMD showed an 8.8% improvement by going from 32 to
64-bit code, but they saw a 14% improvement going from Linux + GCC
(32-bit ) to Windows + Intel C (also 32-bit).

Cool, but I wonder why AMD submitted the scores with the Intel 32-bit
compiler and a 32-bit OS, rather than a 64-bit OS with the 64-bit Pathscale
or PGI compilers? These two companies seem to have designed themselves
completely for AMD64, which I'm completely certain the Intel compilers
aren't.

With AMD64 the extra registers have managed to improve the performance
enough that they not only negate this performance loss, but turn it
into a 5-10% performance gain on average. Not bad at all for a fairly
small cost in die space and virtually no changes to the instruction
set. FWIW the reason why AMD only went to 16 registers (still a
pretty low number as compared to most modern processors) is that this
is the most that they could squeeze into the x86 instruction set
without making fairly major changes (they did a pretty damn good job
of this, obviously they actually put some thought into how to extend
x86 to 64-bits as naturally as possible).

How do we know that the extra performance isn't due to built-in memory
controller and branch prediction?

Yousuf Khan

#4 August 15th 04, 03:13 PM

Grumble wrote :

In short, I am wondering how much improvement the 8 additional
GPRs and 8 additional media registers bring...

not to much in Intel design i guess

http://www.anandtech.com/linux/showdoc.aspx?i=2163

Pozdrawiam.
--
RusH //
http://randki.o2.pl/profil.php?id_r=352019
Like ninjas, true hackers are shrouded in secrecy and mystery.
You may never know -- UNTIL IT'S TOO LATE.

#5 August 16th 04, 03:02 AM

On Sun, 15 Aug 2004 07:46:17 GMT, "Yousuf Khan"
wrote:
Tony Hill wrote:
So, in the end AMD showed an 8.8% improvement by going from 32 to
64-bit code, but they saw a 14% improvement going from Linux + GCC
(32-bit ) to Windows + Intel C (also 32-bit).

Cool, but I wonder why AMD submitted the scores with the Intel 32-bit
compiler and a 32-bit OS, rather than a 64-bit OS with the 64-bit Pathscale
or PGI compilers? These two companies seem to have designed themselves
completely for AMD64, which I'm completely certain the Intel compilers
aren't.

Both of these compilers are still fairly new and they still are not as
fast as Intel's x86 compilers for integer code. Sun submitted some
SPEC CINT results using the Pathscale compiler, and they only managed
a score of 1437/1584 (base/peak) with an Opteron 250 while AMD managed
a score of 1566/1655 with an Opteron 150 using Intel's compiler.
What's more, Sun still had to resort to using GCC for one of their
tests as it was 20% faster on that test than PathCC.

On the floating point side of things though, it's a different story.
Sun's Opteron systems turns in a VERY respectable 1637/1787
(base/peak) score using a combination of GCC, PGI and Pathscale's
compilers. This puts them just about on-par with IBM's Power4 chip,
not bad for a processor that sells for about 1/10th the cost.

With AMD64 the extra registers have managed to improve the performance
enough that they not only negate this performance loss, but turn it
into a 5-10% performance gain on average. Not bad at all for a fairly
small cost in die space and virtually no changes to the instruction
set. FWIW the reason why AMD only went to 16 registers (still a
pretty low number as compared to most modern processors) is that this
is the most that they could squeeze into the x86 instruction set
without making fairly major changes (they did a pretty damn good job
of this, obviously they actually put some thought into how to extend
x86 to 64-bits as naturally as possible).

How do we know that the extra performance isn't due to built-in memory
controller and branch prediction?

Err.. it's not like AMD turns those features off in 32-bit mode on
their Athlon64 and Opteron chips!

-------------
Tony Hill
hilla underscore 20 at yahoo dot ca

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Using onboard RAID controller as additional IDE controller	Giobibo	General	3	December 30th 04 05:53 PM
Epson 925 additional horizontal lines	Nigbo	Printers	1	December 12th 04 02:55 PM
basic graphics - nvidia registers and C	Ben	Nvidia Videocards	0	February 20th 04 08:03 PM
VIA PCI card is used to provide additional IDE channels - DVD drive problems	Reviewer2003	Asus Motherboards	1	January 6th 04 02:19 PM
Adding additional HDD to server	Tony	Dell Computers	0	November 3rd 03 02:24 PM