Computation slow with float than double.

#1 June 7th 05, 04:03 PM

Hello to everybody.

I'm doing some benchmark about a red black Gauss Seidel algorithm with 2
dimensional grid of different size and type, I have some strange result
when I change the computation from double to float.

Here are the time of test with different grid SIZE and type:

SIZE 128 256 512

float 2.20s 2.76s 7.86s

double 2.30s 2.47s 2.59s

As you can see when the grid has a size of 512 node the code with float
type increase the time drastically.
The number of loops is proportional to the SIZE of grid, so the time
should be similar with different SIZE of grid.

Should the float computation always fastest than double?
I would like to know if is a gcc problem (I don't have other compiler)
and if it is not what could be the problem?

Hope to receive an answer as soon as possible,
Thanks

Michele Guidolin.

P.S.
Here are some more information about the test:

The code that I'm testing is the follow and it is the same for the
double version (the constant are not 0.25f but 0.25).

------------- CODE -------------

#define SHIFT_S 9
#define SIZE (1SHIFT_S)
#define DUMP 0

#define MAT(i,j) ((i)SHIFT_S) + (j)

inline void gs_relax(int i,int j,float *u, float *rhs)
{

u[MAT(i,j)] = (float)( rhs[MAT(i,j)] +
0.0f * u[MAT(i,j)] +
0.25f* u[MAT(i+1,j)]+
0.25f* u[MAT(i-1,j)]+
0.25f* u[MAT(i,j+1)]+
0.25f* u[MAT(i,j-1)]);
}

void gs_step_fusion(float *u, float *rhs)
{
int i,j;

/* update the red points:
*/

for(j=1; jSIZE-1; j=j+2)
{
gs_relax(1,j,u,rhs);
}
for(i=2; iSIZE-1; i++)
{
for(j=1+(i+1)%2; jSIZE-1; j=j+2)
{
gs_relax(i,j,u,rhs);
gs_relax(i-1,j,u,rhs);
}

}
for(j=1; jSIZE-1; j=j+2)
{
gs_relax(SIZE-2,j,u,rhs);
}

}

int main(void) {
int iter;

int ITERATIONS = ((int)(pow(2.0,28.0))/(pow((double)SIZE,2.0)));

float u[SIZE*SIZE];
float rhs[SIZE*SIZE];

double time;

printf("-----START SEQUENTIAL FUSION------------\n\n");
printf("size: %d\n",SIZE);
printf("loops: %d\n",ITERATIONS);
init_boundaries(u,rhs);

gettimeofday(&submit_time, 0);

for(iter=0; iterITERATIONS; iter++)
gs_step_fusion(u,rhs);

gettimeofday(&complete_time, 0);

time = timeval_diff(&submit_time, &complete_time);
printf("\ntime: %fs\n",time);

printf("-----END SEQUENTIAL FUSION------------\n\n");

}
---------------CODE--------------

I'm testing this code on this machine:

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping : 1
cpu MHz : 3192.311
cache size : 1024 KB
physical id : 0
siblings : 2
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 3
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni
monitor ds_cpl cid
bogomips : 6324.22

with Hyper threading enable on GNU\Linux 2.6.8.

The compiler is gcc 3.4.4 and the flags a
CFLAGS = -g -O2 -funroll-loops -msse2 -march=pentium4 -Wall

I tried with also -ffast-math and -mfpmath=sse but I have the same result.

#2 June 7th 05, 08:41 PM

Michele Guidolin "michele dot guidolin at ucd dot ie" writes:
Should the float computation always fastest than double?
I would like to know if is a gcc problem (I don't have other compiler)
and if it is not what could be the problem?

I wonder if the problem with float being slower might be an alignment issue.

Later

Mark Hittinger

#3 June 7th 05, 08:44 PM

Seeing as you're using a P4 processor, and using the SSE2. If so, then
I've seen in the past discussion where it's been shown that P4's
single-precision float doesn't work nearly as well as its
double-precision float. It might have something to do with how it
conglomerates the floating point operands together prior to performing
the operations. Apparently, the AMD version of SSE2 doesn't show any
difference in performance whether you're using single or double. It's
just one of those wierd architectural issues in P4.

Yousuf Khan

#4 June 8th 05, 03:16 AM

"Michele Guidolin" "michele dot guidolin at ucd dot ie" wrote in message
...
Hello to everybody.

I'm doing some benchmark about a red black Gauss Seidel algorithm with 2
dimensional grid of different size and type, I have some strange result
when I change the computation from double to float.

Here are the time of test with different grid SIZE and type:

SIZE 128 256 512

float 2.20s 2.76s 7.86s

double 2.30s 2.47s 2.59s

As you can see when the grid has a size of 512 node the code with float
type increase the time drastically.
The number of loops is proportional to the SIZE of grid, so the time
should be similar with different SIZE of grid.

Should the float computation always fastest than double?
I would like to know if is a gcc problem (I don't have other compiler)
and if it is not what could be the problem?

Hope to receive an answer as soon as possible,
Thanks

Michele Guidolin.

P.S.
Here are some more information about the test:

The code that I'm testing is the follow and it is the same for the
double version (the constant are not 0.25f but 0.25).

------------- CODE -------------

#define SHIFT_S 9
#define SIZE (1SHIFT_S)
#define DUMP 0

#define MAT(i,j) ((i)SHIFT_S) + (j)

inline void gs_relax(int i,int j,float *u, float *rhs)
{

u[MAT(i,j)] = (float)( rhs[MAT(i,j)] +
0.0f * u[MAT(i,j)] +
0.25f* u[MAT(i+1,j)]+
0.25f* u[MAT(i-1,j)]+
0.25f* u[MAT(i,j+1)]+
0.25f* u[MAT(i,j-1)]);
}

look at the assembly code and see if the compiler is converting float to
double in the above code. could be that the doubles are being loaded
directly into the floating processor stack and the singles are being
converted in a gp register then loaded into the fp stack. Recompile with
double then look at the assembly code difference.

#5 June 8th 05, 04:47 AM

Beemer Biker wrote:
look at the assembly code and see if the compiler is converting float to
double in the above code. could be that the doubles are being loaded
directly into the floating processor stack and the singles are being
converted in a gp register then loaded into the fp stack. Recompile with
double then look at the assembly code difference.

He's using SSE2. Check out his compiler flags.

Yousuf Khan

#6 June 8th 05, 10:16 AM

Beemer Biker wrote:

look at the assembly code and see if the compiler is converting float to
double in the above code. could be that the doubles are being loaded
directly into the floating processor stack and the singles are being
converted in a gp register then loaded into the fp stack. Recompile with
double then look at the assembly code difference.

Here is the assembler code of float version:

------------- ASM -----------------
inline void gs_relax(int i,int j,float *u, float *rhs)
{
fb: 55 push %ebp

u[MAT(i,j)] = (float)( rhs[MAT(i,j)] +
fc: d9 ee fldz
fe: d9 05 00 00 00 00 flds 0x0
104: d9 c9 fxch %st(1)
106: 89 e5 mov %esp,%ebp
108: 56 push %esi
109: 8b 45 08 mov 0x8(%ebp),%eax
10c: 8b 75 0c mov 0xc(%ebp),%esi
10f: 53 push %ebx
110: c1 e0 09 shl $0x9,%eax
113: 8b 4d 10 mov 0x10(%ebp),%ecx
116: 8b 55 14 mov 0x14(%ebp),%edx
119: 8d 1c 30 lea (%eax,%esi,1),%ebx
11c: c1 e0 00 shl $0x0,%eax
11f: d8 0c 99 fmuls (%ecx,%ebx,4)
122: d8 04 9a fadds (%edx,%ebx,4)
125: 8d 94 30 00 02 00 00 lea 0x200(%eax,%esi,1),%edx
12c: c1 e0 00 shl $0x0,%eax
12f: d9 04 91 flds (%ecx,%edx,4)
132: 8d 84 30 00 fe ff ff lea 0xfffffe00(%eax,%esi,1),%eax
139: d8 ca fmul %st(2),%st
13b: de c1 faddp %st,%st(1)
13d: d9 04 81 flds (%ecx,%eax,4)
140: d8 ca fmul %st(2),%st
142: de c1 faddp %st,%st(1)
144: d9 44 99 04 flds 0x4(%ecx,%ebx,4)
148: d8 ca fmul %st(2),%st
14a: d9 ca fxch %st(2)
14c: d8 4c 99 fc fmuls 0xfffffffc(%ecx,%ebx,4)
150: d9 c9 fxch %st(1)
152: de c2 faddp %st,%st(2)
154: de c1 faddp %st,%st(1)
156: d9 1c 99 fstps (%ecx,%ebx,4)
159: 5b pop %ebx
15a: 5e pop %esi
15b: 5d pop %ebp
15c: c3 ret
------------- ASM -----------------

and here is the assembler code of double version

------------- ASM -----------------
inline void gs_relax(int i,int j,double *u, double *rhs)
{
112: 55 push %ebp

u[MAT(i,j)] = ( rhs[MAT(i,j)] +
113: d9 ee fldz
115: d9 05 00 00 00 00 flds 0x0
11b: d9 c9 fxch %st(1)
11d: 89 e5 mov %esp,%ebp
11f: 56 push %esi
120: 8b 45 08 mov 0x8(%ebp),%eax
123: 8b 75 0c mov 0xc(%ebp),%esi
126: 53 push %ebx
127: c1 e0 09 shl $0x9,%eax
12a: 8b 4d 10 mov 0x10(%ebp),%ecx
12d: 8b 55 14 mov 0x14(%ebp),%edx
130: 8d 1c 30 lea (%eax,%esi,1),%ebx
133: c1 e0 00 shl $0x0,%eax
136: dc 0c d9 fmull (%ecx,%ebx,8)
139: dc 04 da faddl (%edx,%ebx,8)
13c: 8d 94 30 00 02 00 00 lea 0x200(%eax,%esi,1),%edx
143: c1 e0 00 shl $0x0,%eax
146: dd 04 d1 fldl (%ecx,%edx,8)
149: 8d 84 30 00 fe ff ff lea 0xfffffe00(%eax,%esi,1),%eax
150: d8 ca fmul %st(2),%st
152: de c1 faddp %st,%st(1)
154: dd 04 c1 fldl (%ecx,%eax,8)
157: d8 ca fmul %st(2),%st
159: de c1 faddp %st,%st(1)
15b: dd 44 d9 08 fldl 0x8(%ecx,%ebx,8)
15f: d8 ca fmul %st(2),%st
161: d9 ca fxch %st(2)
163: dc 4c d9 f8 fmull 0xfffffff8(%ecx,%ebx,8)
167: d9 c9 fxch %st(1)
169: de c2 faddp %st,%st(2)
16b: de c1 faddp %st,%st(1)
16d: dd 1c d9 fstpl (%ecx,%ebx,8)
170: 5b pop %ebx
171: 5e pop %esi
172: 5d pop %ebp
173: c3 ret
------------- ASM -----------------

It's alot of time that I don't look in assembler code, but for me look
like that in float version is doing all the operation in float.
Maybe Yousuf is right and is the P4 that do very bad.

Somebody that know better the asm can help me?
Thanks.

Michele.

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Western Digital 80GB SE click on hot day and now a slow drive	Darren	Storage (alternative)	6	January 4th 05 04:20 AM
Please Help me choose momory for AMD64	Synapse Syndrome	Asus Motherboards	11	August 26th 04 02:43 PM
Slow network	Michael Culley	General	0	May 9th 04 02:17 PM
Slow BIOS	News Groupie	General	0	March 18th 04 04:30 PM
Slow hard drive in windows XP	Wayne Morgan	General	1	January 25th 04 08:11 PM