If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Computation slow with float than double.
Hello to everybody.
I'm doing some benchmark about a red black Gauss Seidel algorithm with 2 dimensional grid of different size and type, I have some strange result when I change the computation from double to float. Here are the time of test with different grid SIZE and type: SIZE 128 256 512 float 2.20s 2.76s 7.86s double 2.30s 2.47s 2.59s As you can see when the grid has a size of 512 node the code with float type increase the time drastically. The number of loops is proportional to the SIZE of grid, so the time should be similar with different SIZE of grid. Should the float computation always fastest than double? I would like to know if is a gcc problem (I don't have other compiler) and if it is not what could be the problem? Hope to receive an answer as soon as possible, Thanks Michele Guidolin. P.S. Here are some more information about the test: The code that I'm testing is the follow and it is the same for the double version (the constant are not 0.25f but 0.25). ------------- CODE ------------- #define SHIFT_S 9 #define SIZE (1SHIFT_S) #define DUMP 0 #define MAT(i,j) ((i)SHIFT_S) + (j) inline void gs_relax(int i,int j,float *u, float *rhs) { u[MAT(i,j)] = (float)( rhs[MAT(i,j)] + 0.0f * u[MAT(i,j)] + 0.25f* u[MAT(i+1,j)]+ 0.25f* u[MAT(i-1,j)]+ 0.25f* u[MAT(i,j+1)]+ 0.25f* u[MAT(i,j-1)]); } void gs_step_fusion(float *u, float *rhs) { int i,j; /* update the red points: */ for(j=1; jSIZE-1; j=j+2) { gs_relax(1,j,u,rhs); } for(i=2; iSIZE-1; i++) { for(j=1+(i+1)%2; jSIZE-1; j=j+2) { gs_relax(i,j,u,rhs); gs_relax(i-1,j,u,rhs); } } for(j=1; jSIZE-1; j=j+2) { gs_relax(SIZE-2,j,u,rhs); } } int main(void) { int iter; int ITERATIONS = ((int)(pow(2.0,28.0))/(pow((double)SIZE,2.0))); float u[SIZE*SIZE]; float rhs[SIZE*SIZE]; double time; printf("-----START SEQUENTIAL FUSION------------\n\n"); printf("size: %d\n",SIZE); printf("loops: %d\n",ITERATIONS); init_boundaries(u,rhs); gettimeofday(&submit_time, 0); for(iter=0; iterITERATIONS; iter++) gs_step_fusion(u,rhs); gettimeofday(&complete_time, 0); time = timeval_diff(&submit_time, &complete_time); printf("\ntime: %fs\n",time); printf("-----END SEQUENTIAL FUSION------------\n\n"); } ---------------CODE-------------- I'm testing this code on this machine: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping : 1 cpu MHz : 3192.311 cache size : 1024 KB physical id : 0 siblings : 2 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 3 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe pni monitor ds_cpl cid bogomips : 6324.22 with Hyper threading enable on GNU\Linux 2.6.8. The compiler is gcc 3.4.4 and the flags a CFLAGS = -g -O2 -funroll-loops -msse2 -march=pentium4 -Wall I tried with also -ffast-math and -mfpmath=sse but I have the same result. |
#2
|
|||
|
|||
Michele Guidolin "michele dot guidolin at ucd dot ie" writes:
Should the float computation always fastest than double? I would like to know if is a gcc problem (I don't have other compiler) and if it is not what could be the problem? I wonder if the problem with float being slower might be an alignment issue. Later Mark Hittinger |
#3
|
|||
|
|||
Seeing as you're using a P4 processor, and using the SSE2. If so, then
I've seen in the past discussion where it's been shown that P4's single-precision float doesn't work nearly as well as its double-precision float. It might have something to do with how it conglomerates the floating point operands together prior to performing the operations. Apparently, the AMD version of SSE2 doesn't show any difference in performance whether you're using single or double. It's just one of those wierd architectural issues in P4. Yousuf Khan |
#4
|
|||
|
|||
"Michele Guidolin" "michele dot guidolin at ucd dot ie" wrote in message ... Hello to everybody. I'm doing some benchmark about a red black Gauss Seidel algorithm with 2 dimensional grid of different size and type, I have some strange result when I change the computation from double to float. Here are the time of test with different grid SIZE and type: SIZE 128 256 512 float 2.20s 2.76s 7.86s double 2.30s 2.47s 2.59s As you can see when the grid has a size of 512 node the code with float type increase the time drastically. The number of loops is proportional to the SIZE of grid, so the time should be similar with different SIZE of grid. Should the float computation always fastest than double? I would like to know if is a gcc problem (I don't have other compiler) and if it is not what could be the problem? Hope to receive an answer as soon as possible, Thanks Michele Guidolin. P.S. Here are some more information about the test: The code that I'm testing is the follow and it is the same for the double version (the constant are not 0.25f but 0.25). ------------- CODE ------------- #define SHIFT_S 9 #define SIZE (1SHIFT_S) #define DUMP 0 #define MAT(i,j) ((i)SHIFT_S) + (j) inline void gs_relax(int i,int j,float *u, float *rhs) { u[MAT(i,j)] = (float)( rhs[MAT(i,j)] + 0.0f * u[MAT(i,j)] + 0.25f* u[MAT(i+1,j)]+ 0.25f* u[MAT(i-1,j)]+ 0.25f* u[MAT(i,j+1)]+ 0.25f* u[MAT(i,j-1)]); } look at the assembly code and see if the compiler is converting float to double in the above code. could be that the doubles are being loaded directly into the floating processor stack and the singles are being converted in a gp register then loaded into the fp stack. Recompile with double then look at the assembly code difference. |
#5
|
|||
|
|||
Beemer Biker wrote:
look at the assembly code and see if the compiler is converting float to double in the above code. could be that the doubles are being loaded directly into the floating processor stack and the singles are being converted in a gp register then loaded into the fp stack. Recompile with double then look at the assembly code difference. He's using SSE2. Check out his compiler flags. Yousuf Khan |
#6
|
|||
|
|||
Beemer Biker wrote:
look at the assembly code and see if the compiler is converting float to double in the above code. could be that the doubles are being loaded directly into the floating processor stack and the singles are being converted in a gp register then loaded into the fp stack. Recompile with double then look at the assembly code difference. Here is the assembler code of float version: ------------- ASM ----------------- inline void gs_relax(int i,int j,float *u, float *rhs) { fb: 55 push %ebp u[MAT(i,j)] = (float)( rhs[MAT(i,j)] + fc: d9 ee fldz fe: d9 05 00 00 00 00 flds 0x0 104: d9 c9 fxch %st(1) 106: 89 e5 mov %esp,%ebp 108: 56 push %esi 109: 8b 45 08 mov 0x8(%ebp),%eax 10c: 8b 75 0c mov 0xc(%ebp),%esi 10f: 53 push %ebx 110: c1 e0 09 shl $0x9,%eax 113: 8b 4d 10 mov 0x10(%ebp),%ecx 116: 8b 55 14 mov 0x14(%ebp),%edx 119: 8d 1c 30 lea (%eax,%esi,1),%ebx 11c: c1 e0 00 shl $0x0,%eax 11f: d8 0c 99 fmuls (%ecx,%ebx,4) 122: d8 04 9a fadds (%edx,%ebx,4) 125: 8d 94 30 00 02 00 00 lea 0x200(%eax,%esi,1),%edx 12c: c1 e0 00 shl $0x0,%eax 12f: d9 04 91 flds (%ecx,%edx,4) 132: 8d 84 30 00 fe ff ff lea 0xfffffe00(%eax,%esi,1),%eax 139: d8 ca fmul %st(2),%st 13b: de c1 faddp %st,%st(1) 13d: d9 04 81 flds (%ecx,%eax,4) 140: d8 ca fmul %st(2),%st 142: de c1 faddp %st,%st(1) 144: d9 44 99 04 flds 0x4(%ecx,%ebx,4) 148: d8 ca fmul %st(2),%st 14a: d9 ca fxch %st(2) 14c: d8 4c 99 fc fmuls 0xfffffffc(%ecx,%ebx,4) 150: d9 c9 fxch %st(1) 152: de c2 faddp %st,%st(2) 154: de c1 faddp %st,%st(1) 156: d9 1c 99 fstps (%ecx,%ebx,4) 159: 5b pop %ebx 15a: 5e pop %esi 15b: 5d pop %ebp 15c: c3 ret ------------- ASM ----------------- and here is the assembler code of double version ------------- ASM ----------------- inline void gs_relax(int i,int j,double *u, double *rhs) { 112: 55 push %ebp u[MAT(i,j)] = ( rhs[MAT(i,j)] + 113: d9 ee fldz 115: d9 05 00 00 00 00 flds 0x0 11b: d9 c9 fxch %st(1) 11d: 89 e5 mov %esp,%ebp 11f: 56 push %esi 120: 8b 45 08 mov 0x8(%ebp),%eax 123: 8b 75 0c mov 0xc(%ebp),%esi 126: 53 push %ebx 127: c1 e0 09 shl $0x9,%eax 12a: 8b 4d 10 mov 0x10(%ebp),%ecx 12d: 8b 55 14 mov 0x14(%ebp),%edx 130: 8d 1c 30 lea (%eax,%esi,1),%ebx 133: c1 e0 00 shl $0x0,%eax 136: dc 0c d9 fmull (%ecx,%ebx,8) 139: dc 04 da faddl (%edx,%ebx,8) 13c: 8d 94 30 00 02 00 00 lea 0x200(%eax,%esi,1),%edx 143: c1 e0 00 shl $0x0,%eax 146: dd 04 d1 fldl (%ecx,%edx,8) 149: 8d 84 30 00 fe ff ff lea 0xfffffe00(%eax,%esi,1),%eax 150: d8 ca fmul %st(2),%st 152: de c1 faddp %st,%st(1) 154: dd 04 c1 fldl (%ecx,%eax,8) 157: d8 ca fmul %st(2),%st 159: de c1 faddp %st,%st(1) 15b: dd 44 d9 08 fldl 0x8(%ecx,%ebx,8) 15f: d8 ca fmul %st(2),%st 161: d9 ca fxch %st(2) 163: dc 4c d9 f8 fmull 0xfffffff8(%ecx,%ebx,8) 167: d9 c9 fxch %st(1) 169: de c2 faddp %st,%st(2) 16b: de c1 faddp %st,%st(1) 16d: dd 1c d9 fstpl (%ecx,%ebx,8) 170: 5b pop %ebx 171: 5e pop %esi 172: 5d pop %ebp 173: c3 ret ------------- ASM ----------------- It's alot of time that I don't look in assembler code, but for me look like that in float version is doing all the operation in float. Maybe Yousuf is right and is the P4 that do very bad. Somebody that know better the asm can help me? Thanks. Michele. |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Western Digital 80GB SE click on hot day and now a slow drive | Darren | Storage (alternative) | 6 | January 4th 05 04:20 AM |
Please Help me choose momory for AMD64 | Synapse Syndrome | Asus Motherboards | 11 | August 26th 04 02:43 PM |
Slow network | Michael Culley | General | 0 | May 9th 04 02:17 PM |
Slow BIOS | News Groupie | General | 0 | March 18th 04 04:30 PM |
Slow hard drive in windows XP | Wayne Morgan | General | 1 | January 25th 04 08:11 PM |