If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#21
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
Now for a little example.
Let's assume 3 blocks, each of 10 elements, and a loop count of 4. The memory is initialized by code which is not provided. But it could look like this: (These numbers are all 32 bit integers): memory indexes: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 29 memory contents: 01 07 03 04 08 09 00 05 02 06 09 04 05 02 01 00 06 07 03 08 04 09 01 00 02 05 03 07 06 08 BlockBaseA = 00 BlockBaseB = 10 BlockBaseC = 20 This means that each line of memory contents is a block. The numbers/indexes all point towards each other/indexes like so: BlockA: 01-07-03-04-08-09-00-05-02-06 BlockB: 09-04-05-02-01-00-06-07-03-08 BlockC: 04-09-01-00-02-05-03-07-06-08 ElementIndex for A starts at index 0 ElementIndex for B starts at index 0 ElementIndex for C starts at index 0 So 01 09 04 Are the first 3 tupples retrieved for A, B, C 01 indicates the next index is located at index 01 09 indicates the next index is located at index 09 04 indicates the next index is located at index 04 So performing: Memory[ BlockBaseA + 01] leads to 07 Memory[ BlockBaseB + 09] leads to 08 Memory[ BlockBaseC + 01] leads to 09 Next loop: Memory[ BlockBaseA + 07] leads to 05 Memory[ BlockBaseB + 08] leads to 03 Memory[ BlockBaseC + 09] leads to 08 Next loop: Memory[ BlockBaseA + 05] leads to 09 Memory[ BlockBaseB + 03] leads to 02 Memory[ BlockBaseC + 08] leads to 06 Next loop: Memory[ BlockBaseA + 09] leads to 06 Memory[ BlockBaseB + 02] leads to 05 Memory[ BlockBaseC + 06] leads to 03 Done. 4 loops complete. 06 05 03 Are stored in block result. Bye, Skybuck. |
#22
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
I just wrote a redcode program for fun and it helped me spot a bug in my
example (I shall post it soon, fortunately it worked with a little relative addressing adjustment ! ) So gonna correct it here, see * "Skybuck Flying" wrote in message b.home.nl... Now for a little example. Let's assume 3 blocks, each of 10 elements, and a loop count of 4. The memory is initialized by code which is not provided. But it could look like this: (These numbers are all 32 bit integers): memory indexes: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 29 memory contents: 01 07 03 04 08 09 00 05 02 06 09 04 05 02 01 00 06 07 03 08 04 09 01 00 02 05 03 07 06 08 BlockBaseA = 00 BlockBaseB = 10 BlockBaseC = 20 This means that each line of memory contents is a block. The numbers/indexes all point towards each other/indexes like so: BlockA: 01-07-03-04-08-09-00-05-02-06 BlockB: 09-04-05-02-01-00-06-07-03-08 BlockC: 04-09-01-00-02-05-03-07-06-08 ElementIndex for A starts at index 0 ElementIndex for B starts at index 0 ElementIndex for C starts at index 0 So 01 09 04 Are the first 3 tupples retrieved for A, B, C 01 indicates the next index is located at index 01 09 indicates the next index is located at index 09 04 indicates the next index is located at index 04 So performing: Memory[ BlockBaseA + 01] leads to 07 Memory[ BlockBaseB + 09] leads to 08 // * wrong so now eithers for C wrong too. // wrong: Memory[ BlockBaseC + 01] leads to 09 // correct: Memory[ BlockBaseC + 04] leads to 02 Next loop: Memory[ BlockBaseA + 07] leads to 05 Memory[ BlockBaseB + 08] leads to 03 // wrong: Memory[ BlockBaseC + 09] leads to 08 // correct: Memory[ BlockBaseC + 02] leads to 01 Next loop: Memory[ BlockBaseA + 05] leads to 09 Memory[ BlockBaseB + 03] leads to 02 // wrong: Memory[ BlockBaseC + 08] leads to 06 // correct: Memory[ BlockBaseC + 01] leads to 09 Next loop: Memory[ BlockBaseA + 09] leads to 06 Memory[ BlockBaseB + 02] leads to 05 // wrong: Memory[ BlockBaseC + 06] leads to 03 // correct: Memory[ BlockBaseC + 09] leads to 08 Done. 4 loops complete. 06 05 wrong: 03 correct: 08 Are stored in block result. Bye, Skybuck. |
#23
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
The redcode program helped me spot one more little error.
The first loop was actually missing so there will be 5 loops in the example so I am now going to make the example fully correct: (Loop count is now 5) Now for a little example. Let's assume 3 blocks, each of 10 elements, and a loop count of 5. The memory is initialized by code which is not provided. But it could look like this: (These numbers are all 32 bit integers): memory indexes: 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 29 memory contents: 01 07 03 04 08 09 00 05 02 06 09 04 05 02 01 00 06 07 03 08 04 09 01 00 02 05 03 07 06 08 BlockBaseA = 00 BlockBaseB = 10 BlockBaseC = 20 This means that each line of memory contents is a block. The numbers/indexes all point towards each other/indexes like so: BlockA: 01-07-03-04-08-09-00-05-02-06 BlockB: 09-04-05-02-01-00-06-07-03-08 BlockC: 04-09-01-00-02-05-03-07-06-08 ElementIndex for A starts at index 0 ElementIndex for B starts at index 0 ElementIndex for C starts at index 0 So 01 09 04 Are the first 3 tupples retrieved for A, B, C 01 indicates the next index is located at index 01 09 indicates the next index is located at index 09 04 indicates the next index is located at index 04 So performing: Loop 0: Memory[ BlockBaseA + 0] leads to 01 Memory[ BlockBaseB + 0] leads to 09 Memory[ BlockBaseC + 0] leads to 04 Next loop 1: Memory[ BlockBaseA + 01] leads to 07 Memory[ BlockBaseB + 09] leads to 08 Memory[ BlockBaseC + 04] leads to 02 Next loop 2: Memory[ BlockBaseA + 07] leads to 05 Memory[ BlockBaseB + 08] leads to 03 Memory[ BlockBaseC + 02] leads to 01 Next loop 3: Memory[ BlockBaseA + 05] leads to 09 Memory[ BlockBaseB + 03] leads to 02 Memory[ BlockBaseC + 01] leads to 09 Next loop 4: Memory[ BlockBaseA + 09] leads to 06 Memory[ BlockBaseB + 02] leads to 05 Memory[ BlockBaseC + 09] leads to 08 Done. 5 loops complete. 06 05 08 are stored in block result. In a next posting I will post the little redcode program it's kinda funny and might help conceive an assembler program Bye, Skybuck. |
#24
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
The redcode program which executes the example:
Perhaps this simply redcode program (which uses a pseudo/virtual asm instruction set) might help at conceiving a x86 assembly program: ;redcode ;name MemoryTest ;author Skybuck Flying ;strategy MemoryTest ;version 001 ;date 2 august 2011 ; org Start Memory dat 01 dat 07 dat 03 dat 04 dat 08 dat 09 dat 00 dat 05 dat 02 dat 06 dat 09 dat 04 dat 05 dat 02 dat 01 dat 00 dat 06 dat 07 dat 03 dat 08 dat 04 dat 09 dat 01 dat 00 dat 02 dat 05 dat 03 dat 07 dat 06 dat 08 BaseA dat 00 BaseB dat 10 BaseC dat 20 IndexA dat 0 IndexB dat 0 IndexC dat 0 ; compensate for relative addressing, store relative address for memory. LocationA dat Memory, 0 LocationB dat Memory, 0 LocationC dat Memory, 0 Start FirstLoop ; warning: redcode's mov works opposite of intel x86's mov, redcode's mov is: source, dest ; copy memory location to location mov.ab LocationA, LocationA mov.ab LocationB, LocationB mov.ab LocationC, LocationC ; add base to location add.b BaseA, LocationA add.b BaseB, LocationB add.b BaseC, LocationC ; add index to location add.b IndexA, LocationA add.b IndexB, LocationB add.b IndexC, LocationC ; retrieve new index from location mov.b @LocationA, IndexA mov.b @LocationB, IndexB mov.b @LocationC, IndexC ; reduce a counter, repeat 5 times then done. djn FirstLoop, #5 ; show final result by copieing final indexes to block result variables. mov.b IndexA, BlockResultA mov.b IndexB, BlockResultB mov.b IndexC, BlockResultC BlockResultA dat 0 BlockResultB dat 0 BlockResultC dat 0 Bye, Skybuck. |
#25
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
|
#26
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting.("event driven memory system")
Skybuck Flying wrote:
My initial guess at the generated assembler is: 32 bit mode simply does not have enough registers available to "pipeline" in parallel. Perhaps 64 bit mode has enough registers available, however instructions in 64 bit mode are twice as slow but maybe it might still be faster than 32 bit mode. " 64 bit instructions have the same latencies than the corresponding 32 bit instructions (except 64 bit MUL and DIV). Using EBP as general purpose register (GP) frees one register at no costs (except you -have to- use MOV instead of PUSH and POP). " Not on my AMD X2 3800+ processor. It's a fake 64 bit processor If it is an Athlon (the X2 probably means two cores?), it should be 64 bit. Depending on family and stepping, more (or less) XMM capabilities are provided. I think it execute 64 bit instructions as two 32 bit instructions or worse So the 64 bit instructions have clock cycles/latency of at least 2 If I look at the code you posted, it cannot run faster than the 32 bit version. I am not sure about newer processors but I would expect them to be faster but I do also expect general suckage My 64 bit Windows 7 is fully usable in less than 30 s, regardless if it was powered down or rebooted. So it could be interesting to turn the Delphi code into C/C++ code and then try on 64 bit compiler. However I am not really interested in C/C++ code because it's a hell lot of work to convert all Delphi code to C/C++ code so not going to do that. " It doesn't really matter which high level language - Pascal and C(++) compilers generate comparable code, I guess. No automaton can replace a brain. Assembler is the only choice for effective optimisation. " Mhoaw I am not so sure about that... when it comes to register re-use the compiler might spot something more easily than a human being in all that code A look at the posted code tells us the opposite... I would expect a C/C++ compiler to be slightly faster, especially the one from Microsoft, especially in "release mode". Also Microsoft has a 64 bit compiler for a while now... I use AS (part of the minGW-64 compiler suite). But perhaps soon as 64 bit delphi compiler will be out... I think there is already a preview compiler somewhere. This also leaves free pascal compiler as a possible try, which has 64 bit compiling as well, last time I tried it it wasn't so great, but maybe it has improved but I wouldn't hold my breath None the less it's interesting to simply do a free pascal compile for 64 bit mode it shouldn't be that hard to do I guess, so I am gonna give that a try and then see if I can get at the assembler to see what it generates in 64 bit mode " Probably the same with less workarounds. 64 bit code gains most of its speed due to parameter passing via registers. Other optimisations and improvements were possible, but: The 64 bit code I have seen until now still looked like its older 32 bit brethren. " Yeah I have heard the same thing, the extra registers give it some more speed. " Having a short look at your 32 and 64 bit sources: I have to translate them into something human readable before I can start to figure out, what they actually do. This might take a while (I am working from 06:00 until everything is done, leaving not much more than one or two hours for anything else), but I'm sure it is possible to make those loops (at least!) twice as fast with some better code. I'll post some code 'til Saturday or Sunday. " Well I am interested, but I doubt you can do it =D But maybe I underestimate you ! =D Maybe... snip Do you really need a result in seconds? Recent CPUs, be it AMD or LETNi, change the frequency of a single core if required. Busy cores run at higher frequency while the frequency of idle cores is slowed down for that time. I doubt the returned value is accurate in this case. On my Phenom II 1100T, frequency can vary between a few hundred and 3700 MHz (no overclocking) per core, while the processor speed is 3300 MHz (and this probably is the "frequency" reported by the API function "QueryPerformanceFrequency"). " I think it's accurate enough, the timing code be varied with other timers just in case. Read the description at MS knowledgebase. My guess hit the nail's head. One more question: Which parameters do you pass to the function? How does the code get the address of the memory block it should process? First improvement: This sequence movsxd rax,r14d shl rax,2 mov dword ptr [rdx+rax],r9d is equal to the single instruction mov dword ptr [rdx+r14*4),r9d This is repeated three times, adding six -superfluous- clock cycles per iteration. 6 * 80,000 = 480,000 saved cycles. Do you need positive and negative indices? If not, six more clocks per iteration can be saved. And this is no optimization - it's just a correction of flaws... Greetings from Augsburg Bernhard Schornak |
#27
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
"
One more question: Which parameters do you pass to the function? How does the code get the address of the memory block it should process? " No parameters are passed. The routine is part of an "object". So it gets the address of fields which are inside the object. If this is a problem for you then you can change the routine so it accepts parameters. For example: procedure MyRoutine( Memory : pointer ); This would pass the memory pointer in eax if I remember correctly " First improvement: This sequence movsxd rax,r14d shl rax,2 mov dword ptr [rdx+rax],r9d is equal to the single instruction mov dword ptr [rdx+r14*4),r9d This is repeated three times, adding six -superfluous- clock cycles per iteration. 6 * 80,000 = 480,000 saved cycles. " Hmm I'll have to check this out some more... for which of my code version was this ? " Do you need positive and negative indices? If not, six more clocks per iteration can be saved. And this is no optimization - it's just a correction of flaws... " Only positive in the current example, probably real world problem too. I could try changing the type from integer (signed) to longword(unsigned/positive only) to see if that helps the compiler. Bye, Skybuck. |
#28
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting.("event driven memory system")
Skybuck Flying wrote:
" One more question: Which parameters do you pass to the function? How does the code get the address of the memory block it should process? " No parameters are passed. The routine is part of an "object". So it gets the address of fields which are inside the object. If this is a problem for you then you can change the routine so it accepts parameters. It is not, but your 64 bit code retrieves one parameter (passed in RCX). It looks like it is an address, 'cause it is used to access memory locations relative to RCX: sub rsp,264 mov qword ptr [rsp+120],rbx mov qword ptr [rsp+128],rdi mov qword ptr [rsp+136],rsi mov qword ptr [rsp+144],r12 mov qword ptr [rsp+152],r13 mov qword ptr [rsp+160],r14 mov qword ptr [rsp+168],r15 mov qword ptr [rsp+112],rcx ; RCX is stored at 112[RSP] lea rcx,qword ptr [rsp+64] call QueryPerformanceCounter mov rax,qword ptr [rsp+112] ; RAX is loaded with the stored content of RCX mov eax,dword ptr [rax+24] ; memory at 24[RAX] is accessed If RCX didn't hold an address, the last line definitely crashed sooner or later. Probably, your compiler passes the base address of that array behind your back... For example: procedure MyRoutine( Memory : pointer ); This would pass the memory pointer in eax if I remember correctly No. In 32 bit code, parameters are passed on the stack. In 64 bit Windows, the first four parameters are passed in RCX, RDX, R08 and R09, respective XMM0...XMM3 for FP values. Remaining parameters are passed on the stack at 0x20[RSP] and up. The area 0x00 ... 0x20[RSP] is called "red zone". It is reserved for the called function. " First improvement: This sequence movsxd rax,r14d shl rax,2 mov dword ptr [rdx+rax],r9d is equal to the single instruction mov dword ptr [rdx+r14*4),r9d This is repeated three times, adding six -superfluous- clock cycles per iteration. 6 * 80,000 = 480,000 saved cycles. " Hmm I'll have to check this out some more... for which of my code version was this ? R14 is a 64 bit register (only available in long mode). " Do you need positive and negative indices? If not, six more clocks per iteration can be saved. And this is no optimization - it's just a correction of flaws... " Only positive in the current example, probably real world problem too. I could try changing the type from integer (signed) to longword(unsigned/positive only) to see if that helps the compiler. It surely avoids many "workarounds" like this one movsxd rax,r14d add rax,3 mov r14d,eax which can be reduced to add r14d,3 One instead of three clock cycles in time critical code is an improvement. I'm going to translate the 64 bit code tomorrow evening (partially done, but unfinished). I love weekends... Greetings from Augsburg Bernhard Schornak |
#29
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
Here is the new Delphi/Pascal code, it uses longwords which are 32 bit
unsigned integers. Some code has been added to process the remaining blocks if any. It's indeed now about as fast as the 32 bit version. However it would be more interesting to test what happens when 64 bit elements are used. So I will make another version later on, which would be more interesting if you could optimize that instead of the longword version. But for now here is the longword version and it's assembly listing: // *** Begin of Delphi/Pascal 32 bit code *** // paired version // version 0.03, optimized inner loop and easy to use local variables/registers. // version 0.04, timing code moved outside of routine, types changed to longword. // this code assumes mBlockCount is at least 4. // this code assumes vLoopCount is at least 1. // code further corrected so remaining blocks are processed as well. procedure TCPUMemoryTest.ExecuteCPU; var vLoopIndex : longword; vBlockIndexA : longword; vBlockIndexB : longword; vBlockIndexC : longword; vElementIndexA : longword; vElementIndexB : longword; vElementIndexC : longword; vElementCount : longword; vBlockCount : longword; vLoopCount : longword; vBlockBaseA : longword; vBlockBaseB : longword; vBlockBaseC : longword; begin vElementCount := mElementCount; vBlockCount := mBlockCount; vLoopCount := mLoopCount; vBlockIndexA := 0; vBlockIndexB := 1; vBlockIndexC := 2; while vBlockIndexA = (vBlockCount-4) do begin vBlockBaseA := vBlockIndexA * vElementCount; vBlockBaseB := vBlockIndexB * vElementCount; vBlockBaseC := vBlockIndexC * vElementCount; vElementIndexA := 0; vElementIndexB := 0; vElementIndexC := 0; for vLoopIndex := 0 to vLoopCount-1 do begin vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ]; vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ]; vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ]; end; mBlockResult[ vBlockIndexA ] := vElementIndexA; mBlockResult[ vBlockIndexB ] := vElementIndexB; mBlockResult[ vBlockIndexC ] := vElementIndexC; vBlockIndexA := vBlockIndexA + 3; vBlockIndexB := vBlockIndexB + 3; vBlockIndexC := vBlockIndexC + 3; end; while vBlockIndexA = (vBlockCount-1) do begin vBlockBaseA := vBlockIndexA * vElementCount; vElementIndexA := 0; for vLoopIndex := 0 to vLoopCount-1 do begin vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ]; end; mBlockResult[ vBlockIndexA ] := vElementIndexA; vBlockIndexA := vBlockIndexA + 1; end; end; // *** End of Delphi/Pascal 32 bit code *** // *** Begin of Free Pascal 64 bit output for 32 bit example *** _TEXT SEGMENT ALIGN 16 PUBLIC UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU: ; Temps allocated between rsp+32 and rsp+120 ; [360] begin sub rsp,168 ; Var $self located in register r15 ; Var vLoopIndex located in register eax ; Var vBlockIndexA located in register r14d ; Var vBlockIndexB located in register ecx ; Var vBlockIndexC located in register ebx ; Var vElementIndexA located in register r9d ; Var vElementIndexB located in register r10d ; Var vElementIndexC located in register r11d ; Var vElementCount located in register eax ; Var vBlockCount located in register eax ; Var vLoopCount located in register eax ; Var vBlockBaseA located in register esi ; Var vBlockBaseB located in register edi ; Var vBlockBaseC located in register r8d mov qword ptr [rsp+64],rbx mov qword ptr [rsp+72],rdi mov qword ptr [rsp+80],rsi mov qword ptr [rsp+88],r12 mov qword ptr [rsp+96],r13 mov qword ptr [rsp+104],r14 mov qword ptr [rsp+112],r15 mov r15,rcx ; [361] vElementCount := mElementCount; mov eax,dword ptr [r15+24] mov qword ptr [rsp+40],rax ; [362] vBlockCount := mBlockCount; mov eax,dword ptr [r15+28] mov qword ptr [rsp+56],rax ; [363] vLoopCount := mLoopCount; mov eax,dword ptr [r15+32] mov qword ptr [rsp+32],rax ; [365] vBlockIndexA := 0; mov r14d,0 ; [366] vBlockIndexB := 1; mov ecx,1 ; [367] vBlockIndexC := 2; mov ebx,2 ; [368] while vBlockIndexA = (vBlockCount-4) do jmp @@j146 ALIGN 8 @@j145: ; [370] vBlockBaseA := vBlockIndexA * vElementCount; mov r12d,r14d and r12d,-1 mov edx,dword ptr [rsp+40] mov eax,edx and eax,-1 mul r12 mov esi,eax ; [371] vBlockBaseB := vBlockIndexB * vElementCount; mov edx,ecx and edx,-1 mov r12d,dword ptr [rsp+40] mov eax,r12d and eax,-1 mul rdx mov edi,eax ; [372] vBlockBaseC := vBlockIndexC * vElementCount; mov edx,ebx and edx,-1 mov r12d,dword ptr [rsp+40] mov eax,r12d and eax,-1 mul rdx mov r8d,eax ; [374] vElementIndexA := 0; mov r9d,0 ; [375] vElementIndexB := 0; mov r10d,0 ; [376] vElementIndexC := 0; mov r11d,0 ; [378] for vLoopIndex := 0 to vLoopCount-1 do mov edx,dword ptr [rsp+32] mov eax,edx and eax,-1 dec rax mov r12d,eax mov eax,0 mov qword ptr [rsp+48],rax mov eax,dword ptr [rsp+48] cmp r12d,eax jb @@j161 mov eax,dword ptr [rsp+48] dec eax mov qword ptr [rsp+48],rax ALIGN 8 @@j162: mov eax,dword ptr [rsp+48] inc eax mov qword ptr [rsp+48],rax ; [380] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ]; mov r13,qword ptr [r15+8] mov eax,esi and eax,-1 mov edx,r9d and edx,-1 add rax,rdx shl rax,2 mov r9d,dword ptr [r13+rax] ; [381] vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ]; mov rax,qword ptr [r15+8] mov edx,edi and edx,-1 mov r13d,r10d and r13d,-1 add rdx,r13 shl rdx,2 mov r10d,dword ptr [rax+rdx] ; [382] vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ]; mov rax,qword ptr [r15+8] mov edx,r8d and edx,-1 mov r13d,r11d and r13d,-1 add rdx,r13 shl rdx,2 mov r11d,dword ptr [rax+rdx] mov eax,dword ptr [rsp+48] cmp r12d,eax ja @@j162 @@j161: ; [385] mBlockResult[ vBlockIndexA ] := vElementIndexA; mov rdx,qword ptr [r15+16] mov eax,r14d and eax,-1 shl rax,2 mov dword ptr [rdx+rax],r9d ; [386] mBlockResult[ vBlockIndexB ] := vElementIndexB; mov rdx,qword ptr [r15+16] mov eax,ecx and eax,-1 shl rax,2 mov dword ptr [rdx+rax],r10d ; [387] mBlockResult[ vBlockIndexC ] := vElementIndexC; mov rax,qword ptr [r15+16] mov edx,ebx and edx,-1 shl rdx,2 mov dword ptr [rax+rdx],r11d ; [389] vBlockIndexA := vBlockIndexA + 3; mov eax,r14d and eax,-1 add rax,3 mov r14d,eax ; [390] vBlockIndexB := vBlockIndexB + 3; mov eax,ecx and eax,-1 add rax,3 mov ecx,eax ; [391] vBlockIndexC := vBlockIndexC + 3; mov eax,ebx and eax,-1 add rax,3 mov ebx,eax @@j146: mov eax,dword ptr [rsp+56] mov edx,eax and edx,-1 sub rdx,4 mov eax,r14d and eax,-1 cmp rdx,rax jge @@j145 @@j147: ; [394] while vBlockIndexA = (vBlockCount-1) do jmp @@j182 ALIGN 8 @@j181: ; [396] vBlockBaseA := vBlockIndexA * vElementCount; mov ecx,r14d and ecx,-1 mov edx,dword ptr [rsp+40] mov eax,edx and eax,-1 mul rcx mov esi,eax ; [398] vElementIndexA := 0; mov r9d,0 ; [400] for vLoopIndex := 0 to vLoopCount-1 do mov eax,dword ptr [rsp+32] mov edx,eax and edx,-1 dec rdx mov eax,0 mov qword ptr [rsp+48],rax mov eax,dword ptr [rsp+48] cmp edx,eax jb @@j189 mov eax,dword ptr [rsp+48] dec eax mov qword ptr [rsp+48],rax ALIGN 8 @@j190: mov eax,dword ptr [rsp+48] inc eax mov qword ptr [rsp+48],rax ; [402] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ]; mov rbx,qword ptr [r15+8] mov ecx,esi and ecx,-1 mov eax,r9d and eax,-1 add rcx,rax shl rcx,2 mov r9d,dword ptr [rbx+rcx] mov eax,dword ptr [rsp+48] cmp edx,eax ja @@j190 @@j189: ; [405] mBlockResult[ vBlockIndexA ] := vElementIndexA; mov rdx,qword ptr [r15+16] mov eax,r14d and eax,-1 shl rax,2 mov dword ptr [rdx+rax],r9d ; [407] vBlockIndexA := vBlockIndexA + 1; mov eax,r14d and eax,-1 inc rax mov r14d,eax @@j182: mov eax,dword ptr [rsp+56] mov edx,eax and edx,-1 dec rdx mov eax,r14d and eax,-1 cmp rdx,rax jge @@j181 @@j183: ; [410] end; mov rbx,qword ptr [rsp+64] mov rdi,qword ptr [rsp+72] mov rsi,qword ptr [rsp+80] mov r12,qword ptr [rsp+88] mov r13,qword ptr [rsp+96] mov r14,qword ptr [rsp+104] mov r15,qword ptr [rsp+112] add rsp,168 ret _TEXT ENDS // *** End of Free Pascal 64 bit output for 32 bit example *** Bye, Skybuck. |
#30
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
Also I am not sure if the correct tools are being called. I am using Textpad
4 to invoke Free Pascal Compiler. After compiling the 64 bit sources it returns: " Assembling: O:\FreePascal\Tests\test cpu random memory access performance\version 0.04 compile with free pascal\unit_TCPUMemoryTest_version_001.s O:\FreePascal\Tests\test cpu random memory access performance\version 0.04 compile with free pascal\unit_TCPUMemoryTest_version_001.s(717) : error A2070:invalid instruction operands O:\FreePascal\Tests\test cpu random memory access performance\version 0.04 compile with free pascal\unit_TCPUMemoryTest_version_001.s(752) : error A2070:invalid instruction operands O:\FreePascal\Tests\test cpu random memory access performance\version 0.04 compile with free pascal\unit_TCPUMemoryTest_version_001.s(753) : error A2070:invalid instruction operands Microsoft (R) Macro Assembler (x64) Version 10.00.30319.01 Copyright (C) Microsoft Corporation. All rights reserved. unit_TCPUMemoryTest_version_001.pas(477) Error: Error while assembling exitcode 1 unit_TCPUMemoryTest_version_001.pas(477) Fatal: There were 2 errors compiling module, stopping Fatal: Compilation aborted Tool completed with exit code 1 " ^ Microsoft Macro Assembler ? Anyway... I modified the sources so it now uses 64 bit unsigned integers. For some reason the code is now slow again, twice as slow as 32 bit version. (Perhaps it's just a slow 64 bit processor as I suspect I benchmarked 64 bit integer operations some time ago. There should be a usenet posting of it somewhere in google newsgroups or so ) Maybe the loops should use 32 bits instead of 64 bits ? Anyway here is version 0.05 the 64 bit version: // *** Begin of Delphi/Pascal 64 bit Code *** // version 0.05, 64 bit version. procedure TCPUMemoryTest.ExecuteCPU; var vLoopIndex : uint64; vBlockIndexA : uint64; vBlockIndexB : uint64; vBlockIndexC : uint64; vElementIndexA : uint64; vElementIndexB : uint64; vElementIndexC : uint64; vElementCount : uint64; vBlockCount : uint64; vLoopCount : uint64; vBlockBaseA : uint64; vBlockBaseB : uint64; vBlockBaseC : uint64; begin vElementCount := mElementCount; vBlockCount := mBlockCount; vLoopCount := mLoopCount; vBlockIndexA := 0; vBlockIndexB := 1; vBlockIndexC := 2; while vBlockIndexA = (vBlockCount-4) do begin vBlockBaseA := vBlockIndexA * vElementCount; vBlockBaseB := vBlockIndexB * vElementCount; vBlockBaseC := vBlockIndexC * vElementCount; vElementIndexA := 0; vElementIndexB := 0; vElementIndexC := 0; for vLoopIndex := 0 to vLoopCount-1 do begin vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ]; vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ]; vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ]; end; mBlockResult[ vBlockIndexA ] := vElementIndexA; mBlockResult[ vBlockIndexB ] := vElementIndexB; mBlockResult[ vBlockIndexC ] := vElementIndexC; vBlockIndexA := vBlockIndexA + 3; vBlockIndexB := vBlockIndexB + 3; vBlockIndexC := vBlockIndexC + 3; end; while vBlockIndexA = (vBlockCount-1) do begin vBlockBaseA := vBlockIndexA * vElementCount; vElementIndexA := 0; for vLoopIndex := 0 to vLoopCount-1 do begin vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ]; end; mBlockResult[ vBlockIndexA ] := vElementIndexA; vBlockIndexA := vBlockIndexA + 1; end; end; // *** End of Delphi/Pascal 64 bit Code *** // *** Begin of Free Pascal 64 bit code for 64 bit example *** _TEXT SEGMENT ALIGN 16 PUBLIC UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU: ; Temps allocated between rsp+32 and rsp+112 ; [390] begin sub rsp,152 ; Var $self located in register r15 ; Var vLoopIndex located in register rax ; Var vBlockIndexA located in register rcx ; Var vBlockIndexB located in register rbx ; Var vBlockIndexC located in register rsi ; Var vElementIndexA located in register r10 ; Var vElementIndexB located in register r11 ; Var vElementIndexC located in register r12 ; Var vElementCount located in register rax ; Var vBlockCount located in register rax ; Var vLoopCount located in register r14 ; Var vBlockBaseA located in register rdi ; Var vBlockBaseB located in register r8 ; Var vBlockBaseC located in register r9 mov qword ptr [rsp+56],rbx mov qword ptr [rsp+64],rdi mov qword ptr [rsp+72],rsi mov qword ptr [rsp+80],r12 mov qword ptr [rsp+88],r13 mov qword ptr [rsp+96],r14 mov qword ptr [rsp+104],r15 mov r15,rcx ; [391] vElementCount := mElementCount; mov rax,qword ptr [r15+24] mov qword ptr [rsp+32],rax ; [392] vBlockCount := mBlockCount; mov rax,qword ptr [r15+32] mov qword ptr [rsp+48],rax ; [393] vLoopCount := mLoopCount; mov r14,qword ptr [r15+40] ; [395] vBlockIndexA := 0; mov rcx,0 ; [396] vBlockIndexB := 1; mov rbx,1 ; [397] vBlockIndexC := 2; mov rsi,2 ; [398] while vBlockIndexA = (vBlockCount-4) do jmp @@j166 ALIGN 8 @@j165: ; [400] vBlockBaseA := vBlockIndexA * vElementCount; mov rax,qword ptr [rsp+32] mul rcx mov rdi,rax ; [401] vBlockBaseB := vBlockIndexB * vElementCount; mov rax,qword ptr [rsp+32] mul rbx mov r8,rax ; [402] vBlockBaseC := vBlockIndexC * vElementCount; mov rax,qword ptr [rsp+32] mul rsi mov r9,rax ; [404] vElementIndexA := 0; mov r10,0 ; [405] vElementIndexB := 0; mov r11,0 ; [406] vElementIndexC := 0; mov r12,0 ; [408] for vLoopIndex := 0 to vLoopCount-1 do mov rax,r14 dec rax mov rdx,rax mov qword ptr [rsp+40],0 cmp rdx,qword ptr [rsp+40] jb @@j181 dec qword ptr [rsp+40] ALIGN 8 @@j182: inc qword ptr [rsp+40] ; [410] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ]; mov rax,qword ptr [r15+8] mov r13,r10 add r13,rdi shl r13,3 mov r10,qword ptr [rax+r13] ; [411] vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ]; mov rax,qword ptr [r15+8] mov r13,r11 add r13,r8 shl r13,3 mov r11,qword ptr [rax+r13] ; [412] vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ]; mov rax,qword ptr [r15+8] mov r13,r12 add r13,r9 shl r13,3 mov r12,qword ptr [rax+r13] cmp rdx,qword ptr [rsp+40] ja @@j182 @@j181: ; [415] mBlockResult[ vBlockIndexA ] := vElementIndexA; mov rdx,qword ptr [r15+16] mov rax,rcx shl rax,3 mov qword ptr [rdx+rax],r10 ; [416] mBlockResult[ vBlockIndexB ] := vElementIndexB; mov rax,qword ptr [r15+16] mov rdx,rbx shl rdx,3 mov qword ptr [rax+rdx],r11 ; [417] mBlockResult[ vBlockIndexC ] := vElementIndexC; mov rdx,qword ptr [r15+16] mov rax,rsi shl rax,3 mov qword ptr [rdx+rax],r12 ; [419] vBlockIndexA := vBlockIndexA + 3; mov rax,rcx add rax,3 mov rcx,rax ; [420] vBlockIndexB := vBlockIndexB + 3; mov rax,rbx add rax,3 mov rbx,rax ; [421] vBlockIndexC := vBlockIndexC + 3; mov rax,rsi add rax,3 mov rsi,rax @@j166: mov rax,qword ptr [rsp+48] sub rax,4 cmp rax,rcx jae @@j165 @@j167: ; [424] while vBlockIndexA = (vBlockCount-1) do jmp @@j202 ALIGN 8 @@j201: ; [426] vBlockBaseA := vBlockIndexA * vElementCount; mov rax,qword ptr [rsp+32] mul rcx mov rdi,rax ; [428] vElementIndexA := 0; mov r10,0 ; [430] for vLoopIndex := 0 to vLoopCount-1 do mov rax,r14 dec rax mov qword ptr [rsp+40],0 cmp rax,qword ptr [rsp+40] jb @@j209 dec qword ptr [rsp+40] ALIGN 8 @@j210: inc qword ptr [rsp+40] ; [432] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ]; mov rbx,qword ptr [r15+8] mov rdx,r10 add rdx,rdi shl rdx,3 mov r10,qword ptr [rbx+rdx] cmp rax,qword ptr [rsp+40] ja @@j210 @@j209: ; [435] mBlockResult[ vBlockIndexA ] := vElementIndexA; mov rdx,qword ptr [r15+16] mov rax,rcx shl rax,3 mov qword ptr [rdx+rax],r10 ; [437] vBlockIndexA := vBlockIndexA + 1; mov rax,rcx inc rax mov rcx,rax @@j202: mov rax,qword ptr [rsp+48] dec rax cmp rax,rcx jae @@j201 @@j203: ; [439] end; mov rbx,qword ptr [rsp+56] mov rdi,qword ptr [rsp+64] mov rsi,qword ptr [rsp+72] mov r12,qword ptr [rsp+80] mov r13,qword ptr [rsp+88] mov r14,qword ptr [rsp+96] mov r15,qword ptr [rsp+104] add rsp,152 ret _TEXT ENDS // *** End of Free Pascal 64 bit code for 64 bit example *** Bye, Skybuck. |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
An idea how to speed up computer programs and avoid waiting. ("event driven memory system") | Skybuck Flying[_7_] | Nvidia Videocards | 22 | August 15th 11 03:14 AM |
Dimension 8400 w/intel 670 3.8gig processor "Thermal Event" | Brad[_3_] | Dell Computers | 44 | April 23rd 11 11:09 PM |
Can't "unsync" memory bus speed (A8V-E SE) | Hackworth | Asus Motherboards | 2 | September 6th 06 05:28 AM |
P5WD2-E system "hang" after memory size | [email protected] | Asus Motherboards | 12 | July 8th 06 11:24 PM |