A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » Video Cards » Nvidia Videocards
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

An idea how to speed up computer programs and avoid waiting. ("event driven memory system")



 
 
Thread Tools Display Modes
  #21  
Old August 1st 11, 11:37 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Now for a little example.

Let's assume 3 blocks, each of 10 elements, and a loop count of 4.

The memory is initialized by code which is not provided.

But it could look like this:

(These numbers are all 32 bit integers):

memory indexes:
00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 29 29

memory contents:
01 07 03 04 08 09 00 05 02 06
09 04 05 02 01 00 06 07 03 08
04 09 01 00 02 05 03 07 06 08

BlockBaseA = 00
BlockBaseB = 10
BlockBaseC = 20

This means that each line of memory contents is a block.

The numbers/indexes all point towards each other/indexes like so:

BlockA: 01-07-03-04-08-09-00-05-02-06
BlockB: 09-04-05-02-01-00-06-07-03-08
BlockC: 04-09-01-00-02-05-03-07-06-08

ElementIndex for A starts at index 0
ElementIndex for B starts at index 0
ElementIndex for C starts at index 0

So
01
09
04

Are the first 3 tupples retrieved for A, B, C

01 indicates the next index is located at index 01
09 indicates the next index is located at index 09
04 indicates the next index is located at index 04

So performing:

Memory[ BlockBaseA + 01] leads to 07
Memory[ BlockBaseB + 09] leads to 08
Memory[ BlockBaseC + 01] leads to 09

Next loop:

Memory[ BlockBaseA + 07] leads to 05
Memory[ BlockBaseB + 08] leads to 03
Memory[ BlockBaseC + 09] leads to 08

Next loop:
Memory[ BlockBaseA + 05] leads to 09
Memory[ BlockBaseB + 03] leads to 02
Memory[ BlockBaseC + 08] leads to 06

Next loop:
Memory[ BlockBaseA + 09] leads to 06
Memory[ BlockBaseB + 02] leads to 05
Memory[ BlockBaseC + 06] leads to 03

Done. 4 loops complete.

06
05
03

Are stored in block result.

Bye,
Skybuck.

  #22  
Old August 1st 11, 11:58 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

I just wrote a redcode program for fun and it helped me spot a bug in my
example
(I shall post it soon, fortunately it worked with a little relative
addressing adjustment ! )

So gonna correct it here, see *

"Skybuck Flying" wrote in message
b.home.nl...

Now for a little example.

Let's assume 3 blocks, each of 10 elements, and a loop count of 4.

The memory is initialized by code which is not provided.

But it could look like this:

(These numbers are all 32 bit integers):

memory indexes:
00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 29 29

memory contents:
01 07 03 04 08 09 00 05 02 06
09 04 05 02 01 00 06 07 03 08
04 09 01 00 02 05 03 07 06 08

BlockBaseA = 00
BlockBaseB = 10
BlockBaseC = 20

This means that each line of memory contents is a block.

The numbers/indexes all point towards each other/indexes like so:

BlockA: 01-07-03-04-08-09-00-05-02-06
BlockB: 09-04-05-02-01-00-06-07-03-08
BlockC: 04-09-01-00-02-05-03-07-06-08

ElementIndex for A starts at index 0
ElementIndex for B starts at index 0
ElementIndex for C starts at index 0

So
01
09
04

Are the first 3 tupples retrieved for A, B, C

01 indicates the next index is located at index 01
09 indicates the next index is located at index 09
04 indicates the next index is located at index 04

So performing:

Memory[ BlockBaseA + 01] leads to 07
Memory[ BlockBaseB + 09] leads to 08
// * wrong so now eithers for C wrong too.
// wrong: Memory[ BlockBaseC + 01] leads to 09
// correct: Memory[ BlockBaseC + 04] leads to 02

Next loop:

Memory[ BlockBaseA + 07] leads to 05
Memory[ BlockBaseB + 08] leads to 03
// wrong: Memory[ BlockBaseC + 09] leads to 08
// correct: Memory[ BlockBaseC + 02] leads to 01

Next loop:
Memory[ BlockBaseA + 05] leads to 09
Memory[ BlockBaseB + 03] leads to 02
// wrong: Memory[ BlockBaseC + 08] leads to 06
// correct: Memory[ BlockBaseC + 01] leads to 09

Next loop:
Memory[ BlockBaseA + 09] leads to 06
Memory[ BlockBaseB + 02] leads to 05
// wrong: Memory[ BlockBaseC + 06] leads to 03
// correct: Memory[ BlockBaseC + 09] leads to 08

Done. 4 loops complete.

06
05
wrong: 03
correct: 08

Are stored in block result.

Bye,
Skybuck.

  #23  
Old August 2nd 11, 12:04 AM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

The redcode program helped me spot one more little error.

The first loop was actually missing so there will be 5 loops in the example
so I am now going to make the example fully correct:

(Loop count is now 5)

Now for a little example.

Let's assume 3 blocks, each of 10 elements, and a loop count of 5.

The memory is initialized by code which is not provided.

But it could look like this:

(These numbers are all 32 bit integers):

memory indexes:
00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 29 29

memory contents:
01 07 03 04 08 09 00 05 02 06
09 04 05 02 01 00 06 07 03 08
04 09 01 00 02 05 03 07 06 08

BlockBaseA = 00
BlockBaseB = 10
BlockBaseC = 20

This means that each line of memory contents is a block.

The numbers/indexes all point towards each other/indexes like so:

BlockA: 01-07-03-04-08-09-00-05-02-06
BlockB: 09-04-05-02-01-00-06-07-03-08
BlockC: 04-09-01-00-02-05-03-07-06-08

ElementIndex for A starts at index 0
ElementIndex for B starts at index 0
ElementIndex for C starts at index 0

So
01
09
04

Are the first 3 tupples retrieved for A, B, C

01 indicates the next index is located at index 01
09 indicates the next index is located at index 09
04 indicates the next index is located at index 04

So performing:

Loop 0:

Memory[ BlockBaseA + 0] leads to 01
Memory[ BlockBaseB + 0] leads to 09
Memory[ BlockBaseC + 0] leads to 04

Next loop 1:

Memory[ BlockBaseA + 01] leads to 07
Memory[ BlockBaseB + 09] leads to 08
Memory[ BlockBaseC + 04] leads to 02

Next loop 2:

Memory[ BlockBaseA + 07] leads to 05
Memory[ BlockBaseB + 08] leads to 03
Memory[ BlockBaseC + 02] leads to 01

Next loop 3:
Memory[ BlockBaseA + 05] leads to 09
Memory[ BlockBaseB + 03] leads to 02
Memory[ BlockBaseC + 01] leads to 09

Next loop 4:
Memory[ BlockBaseA + 09] leads to 06
Memory[ BlockBaseB + 02] leads to 05
Memory[ BlockBaseC + 09] leads to 08

Done. 5 loops complete.

06
05
08

are stored in block result.

In a next posting I will post the little redcode program it's kinda funny
and might help conceive an assembler program

Bye,
Skybuck.

  #24  
Old August 2nd 11, 12:08 AM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

The redcode program which executes the example:

Perhaps this simply redcode program (which uses a pseudo/virtual asm
instruction set) might help at conceiving a x86 assembly program:

;redcode
;name MemoryTest
;author Skybuck Flying
;strategy MemoryTest
;version 001
;date 2 august 2011
;
org Start

Memory
dat 01
dat 07
dat 03
dat 04
dat 08
dat 09
dat 00
dat 05
dat 02
dat 06
dat 09
dat 04
dat 05
dat 02
dat 01
dat 00
dat 06
dat 07
dat 03
dat 08
dat 04
dat 09
dat 01
dat 00
dat 02
dat 05
dat 03
dat 07
dat 06
dat 08

BaseA dat 00
BaseB dat 10
BaseC dat 20

IndexA dat 0
IndexB dat 0
IndexC dat 0

; compensate for relative addressing, store relative address for memory.
LocationA dat Memory, 0
LocationB dat Memory, 0
LocationC dat Memory, 0

Start

FirstLoop
; warning: redcode's mov works opposite of intel x86's mov, redcode's mov
is: source, dest

; copy memory location to location
mov.ab LocationA, LocationA
mov.ab LocationB, LocationB
mov.ab LocationC, LocationC

; add base to location
add.b BaseA, LocationA
add.b BaseB, LocationB
add.b BaseC, LocationC

; add index to location
add.b IndexA, LocationA
add.b IndexB, LocationB
add.b IndexC, LocationC

; retrieve new index from location
mov.b @LocationA, IndexA
mov.b @LocationB, IndexB
mov.b @LocationC, IndexC

; reduce a counter, repeat 5 times then done.
djn FirstLoop, #5

; show final result by copieing final indexes to block result variables.
mov.b IndexA, BlockResultA
mov.b IndexB, BlockResultB
mov.b IndexC, BlockResultC

BlockResultA dat 0
BlockResultB dat 0
BlockResultC dat 0

Bye,
Skybuck.

  #25  
Old August 2nd 11, 06:47 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Roberto Waltman
external usenet poster
 
Posts: 3
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

wrote:
Bernhard Schornak wrote:
It doesn't really matter which high level language -
Pascal and C(++) compilers generate comparable code,
I guess. No automaton can replace a brain.


Tell that to the pilots and designers of "stability augmented"
aircraft.

Assembler
is the only choice for effective optimisation.


Nice try, but no banana - your guess is wrong. Currently, the
language that optimises best is still Fortran - both C and C++
more-or-less forbid it, and I doubt that any Pascal compilers are
seriously maintained for performance any longer.


In addition to the optimization issues, (made more difficult with
multiple functional units, "out of order", etc.) portability and
reliability concerns are also against assembly code.

Regarding the later:

"I was always of the opinion that assembler was to reliable code as
smoking is to good health - you're not serious about the latter until
you give up the former."

(Andy Goldstein, in the x-plane-tech mailing list)

--
Roberto Waltman

[ Please reply to the group.
Return address is invalid ]
  #26  
Old August 3rd 11, 09:52 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Bernhard Schornak
external usenet poster
 
Posts: 17
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

Skybuck Flying wrote:


My initial guess at the generated assembler is:

32 bit mode simply does not have enough registers available to "pipeline" in parallel.

Perhaps 64 bit mode has enough registers available, however instructions in 64 bit mode
are twice as slow but maybe it might still be faster than 32 bit mode.



"
64 bit instructions have the same latencies than the
corresponding 32 bit instructions (except 64 bit MUL
and DIV). Using EBP as general purpose register (GP)
frees one register at no costs (except you -have to-
use MOV instead of PUSH and POP).
"

Not on my AMD X2 3800+ processor. It's a fake 64 bit processor



If it is an Athlon (the X2 probably means two cores?),
it should be 64 bit. Depending on family and stepping,
more (or less) XMM capabilities are provided.


I think it execute 64 bit instructions as two 32 bit instructions or worse

So the 64 bit instructions have clock cycles/latency of at least 2



If I look at the code you posted, it cannot run faster
than the 32 bit version.


I am not sure about newer processors but I would expect them to be faster but I do also
expect general suckage



My 64 bit Windows 7 is fully usable in less than 30 s,
regardless if it was powered down or rebooted.


So it could be interesting to turn the Delphi code into C/C++ code and then try on 64 bit
compiler.

However I am not really interested in C/C++ code because it's a hell lot of work to
convert all Delphi code to C/C++ code so not going to do that.


"
It doesn't really matter which high level language -
Pascal and C(++) compilers generate comparable code,
I guess. No automaton can replace a brain. Assembler
is the only choice for effective optimisation.
"

Mhoaw I am not so sure about that... when it comes to register re-use the compiler might
spot something more easily than a human being in all that code



A look at the posted code tells us the opposite...


I would expect a C/C++ compiler to be slightly faster, especially the one from Microsoft,
especially in "release mode".

Also Microsoft has a 64 bit compiler for a while now...



I use AS (part of the minGW-64 compiler suite).


But perhaps soon as 64 bit delphi compiler will be out... I think there is already a
preview compiler somewhere.

This also leaves free pascal compiler as a possible try, which has 64 bit compiling as
well, last time I tried it it wasn't so great, but maybe it has improved but I wouldn't
hold my breath

None the less it's interesting to simply do a free pascal compile for 64 bit mode it
shouldn't be that hard to do I guess, so I am gonna give that a try and then see if I can
get at the assembler to see what it generates in 64 bit mode


"
Probably the same with less workarounds. 64 bit code
gains most of its speed due to parameter passing via
registers. Other optimisations and improvements were
possible, but: The 64 bit code I have seen until now
still looked like its older 32 bit brethren.
"

Yeah I have heard the same thing, the extra registers give it some more speed.

"
Having a short look at your 32 and 64 bit sources: I
have to translate them into something human readable
before I can start to figure out, what they actually
do. This might take a while (I am working from 06:00
until everything is done, leaving not much more than
one or two hours for anything else), but I'm sure it
is possible to make those loops (at least!) twice as
fast with some better code. I'll post some code 'til
Saturday or Sunday.
"

Well I am interested, but I doubt you can do it =D

But maybe I underestimate you ! =D



Maybe...


snip

Do you really need a result in seconds? Recent CPUs,
be it AMD or LETNi, change the frequency of a single
core if required. Busy cores run at higher frequency
while the frequency of idle cores is slowed down for
that time. I doubt the returned value is accurate in
this case. On my Phenom II 1100T, frequency can vary
between a few hundred and 3700 MHz (no overclocking)
per core, while the processor speed is 3300 MHz (and
this probably is the "frequency" reported by the API
function "QueryPerformanceFrequency").
"

I think it's accurate enough, the timing code be varied with other timers just in case.



Read the description at MS knowledgebase. My guess hit
the nail's head.


One more question:

Which parameters do you pass to the function? How does
the code get the address of the memory block it should
process?


First improvement:

This sequence

movsxd rax,r14d
shl rax,2
mov dword ptr [rdx+rax],r9d

is equal to the single instruction

mov dword ptr [rdx+r14*4),r9d

This is repeated three times, adding six -superfluous-
clock cycles per iteration. 6 * 80,000 = 480,000 saved
cycles.

Do you need positive and negative indices? If not, six
more clocks per iteration can be saved. And this is no
optimization - it's just a correction of flaws...


Greetings from Augsburg

Bernhard Schornak
  #27  
Old August 4th 11, 12:45 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

"
One more question:

Which parameters do you pass to the function? How does
the code get the address of the memory block it should
process?
"

No parameters are passed.

The routine is part of an "object".

So it gets the address of fields which are inside the object.

If this is a problem for you then you can change the routine so it accepts
parameters.

For example:

procedure MyRoutine( Memory : pointer );

This would pass the memory pointer in eax if I remember correctly

"
First improvement:

This sequence

movsxd rax,r14d
shl rax,2
mov dword ptr [rdx+rax],r9d

is equal to the single instruction

mov dword ptr [rdx+r14*4),r9d

This is repeated three times, adding six -superfluous-
clock cycles per iteration. 6 * 80,000 = 480,000 saved
cycles.
"

Hmm I'll have to check this out some more... for which of my code version
was this ?

"
Do you need positive and negative indices? If not, six
more clocks per iteration can be saved. And this is no
optimization - it's just a correction of flaws...
"

Only positive in the current example, probably real world problem too.

I could try changing the type from integer (signed) to
longword(unsigned/positive only) to see if that helps the compiler.

Bye,
Skybuck.

  #28  
Old August 4th 11, 08:28 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Bernhard Schornak
external usenet poster
 
Posts: 17
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

Skybuck Flying wrote:


"
One more question:

Which parameters do you pass to the function? How does
the code get the address of the memory block it should
process?
"

No parameters are passed.

The routine is part of an "object".

So it gets the address of fields which are inside the object.

If this is a problem for you then you can change the routine so it accepts parameters.



It is not, but your 64 bit code retrieves one parameter
(passed in RCX). It looks like it is an address, 'cause
it is used to access memory locations relative to RCX:


sub rsp,264
mov qword ptr [rsp+120],rbx
mov qword ptr [rsp+128],rdi
mov qword ptr [rsp+136],rsi
mov qword ptr [rsp+144],r12
mov qword ptr [rsp+152],r13
mov qword ptr [rsp+160],r14
mov qword ptr [rsp+168],r15
mov qword ptr [rsp+112],rcx ; RCX is stored at 112[RSP]

lea rcx,qword ptr [rsp+64]
call QueryPerformanceCounter

mov rax,qword ptr [rsp+112] ; RAX is loaded with the stored content of RCX
mov eax,dword ptr [rax+24] ; memory at 24[RAX] is accessed


If RCX didn't hold an address, the last line definitely
crashed sooner or later. Probably, your compiler passes
the base address of that array behind your back...


For example:

procedure MyRoutine( Memory : pointer );

This would pass the memory pointer in eax if I remember correctly



No. In 32 bit code, parameters are passed on the stack.
In 64 bit Windows, the first four parameters are passed
in RCX, RDX, R08 and R09, respective XMM0...XMM3 for FP
values. Remaining parameters are passed on the stack at
0x20[RSP] and up. The area 0x00 ... 0x20[RSP] is called
"red zone". It is reserved for the called function.


"
First improvement:

This sequence

movsxd rax,r14d
shl rax,2
mov dword ptr [rdx+rax],r9d

is equal to the single instruction

mov dword ptr [rdx+r14*4),r9d

This is repeated three times, adding six -superfluous-
clock cycles per iteration. 6 * 80,000 = 480,000 saved
cycles.
"

Hmm I'll have to check this out some more... for which of my code version was this ?



R14 is a 64 bit register (only available in long mode).


"
Do you need positive and negative indices? If not, six
more clocks per iteration can be saved. And this is no
optimization - it's just a correction of flaws...
"

Only positive in the current example, probably real world problem too.

I could try changing the type from integer (signed) to longword(unsigned/positive only) to
see if that helps the compiler.



It surely avoids many "workarounds" like this one

movsxd rax,r14d
add rax,3
mov r14d,eax

which can be reduced to

add r14d,3

One instead of three clock cycles in time critical code
is an improvement.

I'm going to translate the 64 bit code tomorrow evening
(partially done, but unfinished). I love weekends...


Greetings from Augsburg

Bernhard Schornak
  #29  
Old August 5th 11, 01:28 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Here is the new Delphi/Pascal code, it uses longwords which are 32 bit
unsigned integers.

Some code has been added to process the remaining blocks if any.

It's indeed now about as fast as the 32 bit version.

However it would be more interesting to test what happens when 64 bit
elements are used.

So I will make another version later on, which would be more interesting if
you could optimize that instead of the longword version.

But for now here is the longword version and it's assembly listing:

// *** Begin of Delphi/Pascal 32 bit code ***

// paired version
// version 0.03, optimized inner loop and easy to use local
variables/registers.
// version 0.04, timing code moved outside of routine, types changed to
longword.
// this code assumes mBlockCount is at least 4.
// this code assumes vLoopCount is at least 1.
// code further corrected so remaining blocks are processed as well.
procedure TCPUMemoryTest.ExecuteCPU;
var
vLoopIndex : longword;

vBlockIndexA : longword;
vBlockIndexB : longword;
vBlockIndexC : longword;

vElementIndexA : longword;
vElementIndexB : longword;
vElementIndexC : longword;

vElementCount : longword;
vBlockCount : longword;
vLoopCount : longword;

vBlockBaseA : longword;
vBlockBaseB : longword;
vBlockBaseC : longword;
begin
vElementCount := mElementCount;
vBlockCount := mBlockCount;
vLoopCount := mLoopCount;

vBlockIndexA := 0;
vBlockIndexB := 1;
vBlockIndexC := 2;
while vBlockIndexA = (vBlockCount-4) do
begin
vBlockBaseA := vBlockIndexA * vElementCount;
vBlockBaseB := vBlockIndexB * vElementCount;
vBlockBaseC := vBlockIndexC * vElementCount;

vElementIndexA := 0;
vElementIndexB := 0;
vElementIndexC := 0;

for vLoopIndex := 0 to vLoopCount-1 do
begin
vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ];
vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ];
end;

mBlockResult[ vBlockIndexA ] := vElementIndexA;
mBlockResult[ vBlockIndexB ] := vElementIndexB;
mBlockResult[ vBlockIndexC ] := vElementIndexC;

vBlockIndexA := vBlockIndexA + 3;
vBlockIndexB := vBlockIndexB + 3;
vBlockIndexC := vBlockIndexC + 3;
end;

while vBlockIndexA = (vBlockCount-1) do
begin
vBlockBaseA := vBlockIndexA * vElementCount;

vElementIndexA := 0;

for vLoopIndex := 0 to vLoopCount-1 do
begin
vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
end;

mBlockResult[ vBlockIndexA ] := vElementIndexA;

vBlockIndexA := vBlockIndexA + 1;
end;
end;

// *** End of Delphi/Pascal 32 bit code ***

// *** Begin of Free Pascal 64 bit output for 32 bit example ***

_TEXT SEGMENT
ALIGN 16
PUBLIC UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU
UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU:
; Temps allocated between rsp+32 and rsp+120
; [360] begin
sub rsp,168
; Var $self located in register r15
; Var vLoopIndex located in register eax
; Var vBlockIndexA located in register r14d
; Var vBlockIndexB located in register ecx
; Var vBlockIndexC located in register ebx
; Var vElementIndexA located in register r9d
; Var vElementIndexB located in register r10d
; Var vElementIndexC located in register r11d
; Var vElementCount located in register eax
; Var vBlockCount located in register eax
; Var vLoopCount located in register eax
; Var vBlockBaseA located in register esi
; Var vBlockBaseB located in register edi
; Var vBlockBaseC located in register r8d
mov qword ptr [rsp+64],rbx
mov qword ptr [rsp+72],rdi
mov qword ptr [rsp+80],rsi
mov qword ptr [rsp+88],r12
mov qword ptr [rsp+96],r13
mov qword ptr [rsp+104],r14
mov qword ptr [rsp+112],r15
mov r15,rcx
; [361] vElementCount := mElementCount;
mov eax,dword ptr [r15+24]
mov qword ptr [rsp+40],rax
; [362] vBlockCount := mBlockCount;
mov eax,dword ptr [r15+28]
mov qword ptr [rsp+56],rax
; [363] vLoopCount := mLoopCount;
mov eax,dword ptr [r15+32]
mov qword ptr [rsp+32],rax
; [365] vBlockIndexA := 0;
mov r14d,0
; [366] vBlockIndexB := 1;
mov ecx,1
; [367] vBlockIndexC := 2;
mov ebx,2
; [368] while vBlockIndexA = (vBlockCount-4) do
jmp @@j146
ALIGN 8
@@j145:
; [370] vBlockBaseA := vBlockIndexA * vElementCount;
mov r12d,r14d
and r12d,-1
mov edx,dword ptr [rsp+40]
mov eax,edx
and eax,-1
mul r12
mov esi,eax
; [371] vBlockBaseB := vBlockIndexB * vElementCount;
mov edx,ecx
and edx,-1
mov r12d,dword ptr [rsp+40]
mov eax,r12d
and eax,-1
mul rdx
mov edi,eax
; [372] vBlockBaseC := vBlockIndexC * vElementCount;
mov edx,ebx
and edx,-1
mov r12d,dword ptr [rsp+40]
mov eax,r12d
and eax,-1
mul rdx
mov r8d,eax
; [374] vElementIndexA := 0;
mov r9d,0
; [375] vElementIndexB := 0;
mov r10d,0
; [376] vElementIndexC := 0;
mov r11d,0
; [378] for vLoopIndex := 0 to vLoopCount-1 do
mov edx,dword ptr [rsp+32]
mov eax,edx
and eax,-1
dec rax
mov r12d,eax
mov eax,0
mov qword ptr [rsp+48],rax
mov eax,dword ptr [rsp+48]
cmp r12d,eax
jb @@j161
mov eax,dword ptr [rsp+48]
dec eax
mov qword ptr [rsp+48],rax
ALIGN 8
@@j162:
mov eax,dword ptr [rsp+48]
inc eax
mov qword ptr [rsp+48],rax
; [380] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
mov r13,qword ptr [r15+8]
mov eax,esi
and eax,-1
mov edx,r9d
and edx,-1
add rax,rdx
shl rax,2
mov r9d,dword ptr [r13+rax]
; [381] vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ];
mov rax,qword ptr [r15+8]
mov edx,edi
and edx,-1
mov r13d,r10d
and r13d,-1
add rdx,r13
shl rdx,2
mov r10d,dword ptr [rax+rdx]
; [382] vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ];
mov rax,qword ptr [r15+8]
mov edx,r8d
and edx,-1
mov r13d,r11d
and r13d,-1
add rdx,r13
shl rdx,2
mov r11d,dword ptr [rax+rdx]
mov eax,dword ptr [rsp+48]
cmp r12d,eax
ja @@j162
@@j161:
; [385] mBlockResult[ vBlockIndexA ] := vElementIndexA;
mov rdx,qword ptr [r15+16]
mov eax,r14d
and eax,-1
shl rax,2
mov dword ptr [rdx+rax],r9d
; [386] mBlockResult[ vBlockIndexB ] := vElementIndexB;
mov rdx,qword ptr [r15+16]
mov eax,ecx
and eax,-1
shl rax,2
mov dword ptr [rdx+rax],r10d
; [387] mBlockResult[ vBlockIndexC ] := vElementIndexC;
mov rax,qword ptr [r15+16]
mov edx,ebx
and edx,-1
shl rdx,2
mov dword ptr [rax+rdx],r11d
; [389] vBlockIndexA := vBlockIndexA + 3;
mov eax,r14d
and eax,-1
add rax,3
mov r14d,eax
; [390] vBlockIndexB := vBlockIndexB + 3;
mov eax,ecx
and eax,-1
add rax,3
mov ecx,eax
; [391] vBlockIndexC := vBlockIndexC + 3;
mov eax,ebx
and eax,-1
add rax,3
mov ebx,eax
@@j146:
mov eax,dword ptr [rsp+56]
mov edx,eax
and edx,-1
sub rdx,4
mov eax,r14d
and eax,-1
cmp rdx,rax
jge @@j145
@@j147:
; [394] while vBlockIndexA = (vBlockCount-1) do
jmp @@j182
ALIGN 8
@@j181:
; [396] vBlockBaseA := vBlockIndexA * vElementCount;
mov ecx,r14d
and ecx,-1
mov edx,dword ptr [rsp+40]
mov eax,edx
and eax,-1
mul rcx
mov esi,eax
; [398] vElementIndexA := 0;
mov r9d,0
; [400] for vLoopIndex := 0 to vLoopCount-1 do
mov eax,dword ptr [rsp+32]
mov edx,eax
and edx,-1
dec rdx
mov eax,0
mov qword ptr [rsp+48],rax
mov eax,dword ptr [rsp+48]
cmp edx,eax
jb @@j189
mov eax,dword ptr [rsp+48]
dec eax
mov qword ptr [rsp+48],rax
ALIGN 8
@@j190:
mov eax,dword ptr [rsp+48]
inc eax
mov qword ptr [rsp+48],rax
; [402] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
mov rbx,qword ptr [r15+8]
mov ecx,esi
and ecx,-1
mov eax,r9d
and eax,-1
add rcx,rax
shl rcx,2
mov r9d,dword ptr [rbx+rcx]
mov eax,dword ptr [rsp+48]
cmp edx,eax
ja @@j190
@@j189:
; [405] mBlockResult[ vBlockIndexA ] := vElementIndexA;
mov rdx,qword ptr [r15+16]
mov eax,r14d
and eax,-1
shl rax,2
mov dword ptr [rdx+rax],r9d
; [407] vBlockIndexA := vBlockIndexA + 1;
mov eax,r14d
and eax,-1
inc rax
mov r14d,eax
@@j182:
mov eax,dword ptr [rsp+56]
mov edx,eax
and edx,-1
dec rdx
mov eax,r14d
and eax,-1
cmp rdx,rax
jge @@j181
@@j183:
; [410] end;
mov rbx,qword ptr [rsp+64]
mov rdi,qword ptr [rsp+72]
mov rsi,qword ptr [rsp+80]
mov r12,qword ptr [rsp+88]
mov r13,qword ptr [rsp+96]
mov r14,qword ptr [rsp+104]
mov r15,qword ptr [rsp+112]
add rsp,168
ret
_TEXT ENDS

// *** End of Free Pascal 64 bit output for 32 bit example ***

Bye,
Skybuck.

  #30  
Old August 5th 11, 01:50 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Also I am not sure if the correct tools are being called. I am using Textpad
4 to invoke Free Pascal Compiler.

After compiling the 64 bit sources it returns:

"
Assembling: O:\FreePascal\Tests\test cpu random memory access
performance\version 0.04 compile with free
pascal\unit_TCPUMemoryTest_version_001.s
O:\FreePascal\Tests\test cpu random memory access performance\version 0.04
compile with free pascal\unit_TCPUMemoryTest_version_001.s(717) : error
A2070:invalid instruction operands
O:\FreePascal\Tests\test cpu random memory access performance\version 0.04
compile with free pascal\unit_TCPUMemoryTest_version_001.s(752) : error
A2070:invalid instruction operands
O:\FreePascal\Tests\test cpu random memory access performance\version 0.04
compile with free pascal\unit_TCPUMemoryTest_version_001.s(753) : error
A2070:invalid instruction operands
Microsoft (R) Macro Assembler (x64) Version 10.00.30319.01
Copyright (C) Microsoft Corporation. All rights reserved.

unit_TCPUMemoryTest_version_001.pas(477) Error: Error while assembling
exitcode 1
unit_TCPUMemoryTest_version_001.pas(477) Fatal: There were 2 errors
compiling module, stopping
Fatal: Compilation aborted

Tool completed with exit code 1
"

^ Microsoft Macro Assembler ?

Anyway...

I modified the sources so it now uses 64 bit unsigned integers.

For some reason the code is now slow again, twice as slow as 32 bit version.
(Perhaps it's just a slow 64 bit processor as I suspect I benchmarked 64
bit integer operations some time ago. There should be a usenet posting of it
somewhere in google newsgroups or so )

Maybe the loops should use 32 bits instead of 64 bits ?

Anyway here is version 0.05 the 64 bit version:

// *** Begin of Delphi/Pascal 64 bit Code ***

// version 0.05, 64 bit version.
procedure TCPUMemoryTest.ExecuteCPU;
var
vLoopIndex : uint64;

vBlockIndexA : uint64;
vBlockIndexB : uint64;
vBlockIndexC : uint64;

vElementIndexA : uint64;
vElementIndexB : uint64;
vElementIndexC : uint64;

vElementCount : uint64;
vBlockCount : uint64;
vLoopCount : uint64;

vBlockBaseA : uint64;
vBlockBaseB : uint64;
vBlockBaseC : uint64;
begin
vElementCount := mElementCount;
vBlockCount := mBlockCount;
vLoopCount := mLoopCount;

vBlockIndexA := 0;
vBlockIndexB := 1;
vBlockIndexC := 2;
while vBlockIndexA = (vBlockCount-4) do
begin
vBlockBaseA := vBlockIndexA * vElementCount;
vBlockBaseB := vBlockIndexB * vElementCount;
vBlockBaseC := vBlockIndexC * vElementCount;

vElementIndexA := 0;
vElementIndexB := 0;
vElementIndexC := 0;

for vLoopIndex := 0 to vLoopCount-1 do
begin
vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ];
vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ];
end;

mBlockResult[ vBlockIndexA ] := vElementIndexA;
mBlockResult[ vBlockIndexB ] := vElementIndexB;
mBlockResult[ vBlockIndexC ] := vElementIndexC;

vBlockIndexA := vBlockIndexA + 3;
vBlockIndexB := vBlockIndexB + 3;
vBlockIndexC := vBlockIndexC + 3;
end;

while vBlockIndexA = (vBlockCount-1) do
begin
vBlockBaseA := vBlockIndexA * vElementCount;

vElementIndexA := 0;

for vLoopIndex := 0 to vLoopCount-1 do
begin
vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
end;

mBlockResult[ vBlockIndexA ] := vElementIndexA;

vBlockIndexA := vBlockIndexA + 1;
end;
end;

// *** End of Delphi/Pascal 64 bit Code ***

// *** Begin of Free Pascal 64 bit code for 64 bit example ***

_TEXT SEGMENT
ALIGN 16
PUBLIC UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU
UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU:
; Temps allocated between rsp+32 and rsp+112
; [390] begin
sub rsp,152
; Var $self located in register r15
; Var vLoopIndex located in register rax
; Var vBlockIndexA located in register rcx
; Var vBlockIndexB located in register rbx
; Var vBlockIndexC located in register rsi
; Var vElementIndexA located in register r10
; Var vElementIndexB located in register r11
; Var vElementIndexC located in register r12
; Var vElementCount located in register rax
; Var vBlockCount located in register rax
; Var vLoopCount located in register r14
; Var vBlockBaseA located in register rdi
; Var vBlockBaseB located in register r8
; Var vBlockBaseC located in register r9
mov qword ptr [rsp+56],rbx
mov qword ptr [rsp+64],rdi
mov qword ptr [rsp+72],rsi
mov qword ptr [rsp+80],r12
mov qword ptr [rsp+88],r13
mov qword ptr [rsp+96],r14
mov qword ptr [rsp+104],r15
mov r15,rcx
; [391] vElementCount := mElementCount;
mov rax,qword ptr [r15+24]
mov qword ptr [rsp+32],rax
; [392] vBlockCount := mBlockCount;
mov rax,qword ptr [r15+32]
mov qword ptr [rsp+48],rax
; [393] vLoopCount := mLoopCount;
mov r14,qword ptr [r15+40]
; [395] vBlockIndexA := 0;
mov rcx,0
; [396] vBlockIndexB := 1;
mov rbx,1
; [397] vBlockIndexC := 2;
mov rsi,2
; [398] while vBlockIndexA = (vBlockCount-4) do
jmp @@j166
ALIGN 8
@@j165:
; [400] vBlockBaseA := vBlockIndexA * vElementCount;
mov rax,qword ptr [rsp+32]
mul rcx
mov rdi,rax
; [401] vBlockBaseB := vBlockIndexB * vElementCount;
mov rax,qword ptr [rsp+32]
mul rbx
mov r8,rax
; [402] vBlockBaseC := vBlockIndexC * vElementCount;
mov rax,qword ptr [rsp+32]
mul rsi
mov r9,rax
; [404] vElementIndexA := 0;
mov r10,0
; [405] vElementIndexB := 0;
mov r11,0
; [406] vElementIndexC := 0;
mov r12,0
; [408] for vLoopIndex := 0 to vLoopCount-1 do
mov rax,r14
dec rax
mov rdx,rax
mov qword ptr [rsp+40],0
cmp rdx,qword ptr [rsp+40]
jb @@j181
dec qword ptr [rsp+40]
ALIGN 8
@@j182:
inc qword ptr [rsp+40]
; [410] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
mov rax,qword ptr [r15+8]
mov r13,r10
add r13,rdi
shl r13,3
mov r10,qword ptr [rax+r13]
; [411] vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ];
mov rax,qword ptr [r15+8]
mov r13,r11
add r13,r8
shl r13,3
mov r11,qword ptr [rax+r13]
; [412] vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ];
mov rax,qword ptr [r15+8]
mov r13,r12
add r13,r9
shl r13,3
mov r12,qword ptr [rax+r13]
cmp rdx,qword ptr [rsp+40]
ja @@j182
@@j181:
; [415] mBlockResult[ vBlockIndexA ] := vElementIndexA;
mov rdx,qword ptr [r15+16]
mov rax,rcx
shl rax,3
mov qword ptr [rdx+rax],r10
; [416] mBlockResult[ vBlockIndexB ] := vElementIndexB;
mov rax,qword ptr [r15+16]
mov rdx,rbx
shl rdx,3
mov qword ptr [rax+rdx],r11
; [417] mBlockResult[ vBlockIndexC ] := vElementIndexC;
mov rdx,qword ptr [r15+16]
mov rax,rsi
shl rax,3
mov qword ptr [rdx+rax],r12
; [419] vBlockIndexA := vBlockIndexA + 3;
mov rax,rcx
add rax,3
mov rcx,rax
; [420] vBlockIndexB := vBlockIndexB + 3;
mov rax,rbx
add rax,3
mov rbx,rax
; [421] vBlockIndexC := vBlockIndexC + 3;
mov rax,rsi
add rax,3
mov rsi,rax
@@j166:
mov rax,qword ptr [rsp+48]
sub rax,4
cmp rax,rcx
jae @@j165
@@j167:
; [424] while vBlockIndexA = (vBlockCount-1) do
jmp @@j202
ALIGN 8
@@j201:
; [426] vBlockBaseA := vBlockIndexA * vElementCount;
mov rax,qword ptr [rsp+32]
mul rcx
mov rdi,rax
; [428] vElementIndexA := 0;
mov r10,0
; [430] for vLoopIndex := 0 to vLoopCount-1 do
mov rax,r14
dec rax
mov qword ptr [rsp+40],0
cmp rax,qword ptr [rsp+40]
jb @@j209
dec qword ptr [rsp+40]
ALIGN 8
@@j210:
inc qword ptr [rsp+40]
; [432] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
mov rbx,qword ptr [r15+8]
mov rdx,r10
add rdx,rdi
shl rdx,3
mov r10,qword ptr [rbx+rdx]
cmp rax,qword ptr [rsp+40]
ja @@j210
@@j209:
; [435] mBlockResult[ vBlockIndexA ] := vElementIndexA;
mov rdx,qword ptr [r15+16]
mov rax,rcx
shl rax,3
mov qword ptr [rdx+rax],r10
; [437] vBlockIndexA := vBlockIndexA + 1;
mov rax,rcx
inc rax
mov rcx,rax
@@j202:
mov rax,qword ptr [rsp+48]
dec rax
cmp rax,rcx
jae @@j201
@@j203:
; [439] end;
mov rbx,qword ptr [rsp+56]
mov rdi,qword ptr [rsp+64]
mov rsi,qword ptr [rsp+72]
mov r12,qword ptr [rsp+80]
mov r13,qword ptr [rsp+88]
mov r14,qword ptr [rsp+96]
mov r15,qword ptr [rsp+104]
add rsp,152
ret
_TEXT ENDS

// *** End of Free Pascal 64 bit code for 64 bit example ***

Bye,
Skybuck.

 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
An idea how to speed up computer programs and avoid waiting. ("event driven memory system") Skybuck Flying[_7_] Nvidia Videocards 22 August 15th 11 03:14 AM
Dimension 8400 w/intel 670 3.8gig processor "Thermal Event" Brad[_3_] Dell Computers 44 April 23rd 11 11:09 PM
Can't "unsync" memory bus speed (A8V-E SE) Hackworth Asus Motherboards 2 September 6th 06 05:28 AM
P5WD2-E system "hang" after memory size [email protected] Asus Motherboards 12 July 8th 06 11:24 PM


All times are GMT +1. The time now is 10:35 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.