A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » Video Cards » Nvidia Videocards
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

An idea how to speed up computer programs and avoid waiting. ("event driven memory system")



 
 
Thread Tools Display Modes
  #11  
Old August 1st 11, 12:16 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

The delphi/pascal code of version 0.02 was a bit messy, redundant and
un-optimized for low level assembler modifications.

Therefore I have created a new version 0.03 which is much more optimized.

The inner loop is kept to a minimum of instructions. (Redundant instructions
are removed/reduced).

So it seems like a good idea to optimized the high level delphi/pascal code
first, this will make further micro optimizations more easy I think !

So here is version 0.03 in delphi/pascal code first and below that the
generated assembler code:

I shall add a little comment below in the assembler code to highlight the
inner loop which needs to micro-optimization

So it turns out there are only 9 assembler instructions which you need to
look at.

I also looked at them, and there seem to be some register dependencies (?).

So perhaps re-writing this assembler code a little bit might give more
performance...

// *** Begin of Delphi/Pascal code ***:

// version 0.03, optimized inner loop and easy to use local
variables/registers.
procedure TCPUMemoryTest.ExecuteCPU;
var
vStart : int64;
vStop : int64;
vFrequency : int64;

// vBlockIndex : integer;
vLoopIndex : integer;

vBlockIndexA : integer;
vBlockIndexB : integer;
vBlockIndexC : integer;

vElementIndexA : integer;
vElementIndexB : integer;
vElementIndexC : integer;

vElementCount : integer;
vBlockCount : integer;
vLoopCount : integer;

vBlockBaseA : integer;
vBlockBaseB : integer;
vBlockBaseC : integer;
begin
QueryPerformanceCounter( vStart );

vElementCount := mElementCount;
vBlockCount := mBlockCount;
vLoopCount := mLoopCount;

vBlockIndexA := 0;
vBlockIndexB := 1;
vBlockIndexC := 2;
while vBlockIndexA = (vBlockCount-4) do
begin
vBlockBaseA := vBlockIndexA * vElementCount;
vBlockBaseB := vBlockIndexB * vElementCount;
vBlockBaseC := vBlockIndexC * vElementCount;

vElementIndexA := 0;
vElementIndexB := 0;
vElementIndexC := 0;

for vLoopIndex := 0 to vLoopCount-1 do
begin
vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ];
vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ];
end;

mBlockResult[ vBlockIndexA ] := vElementIndexA;
mBlockResult[ vBlockIndexB ] := vElementIndexB;
mBlockResult[ vBlockIndexC ] := vElementIndexC;

vBlockIndexA := vBlockIndexA + 3;
vBlockIndexB := vBlockIndexB + 3;
vBlockIndexC := vBlockIndexC + 3;
end;
QueryPerformanceCounter( vStop );
QueryPerformanceFrequency( vFrequency );

mCPUExecutionTimeInSeconds := (vStop - vStart) / vFrequency;
end;

// *** End of Delphi/Pascal code ***


// *** Begin of generated assembler ***:

unit_TCPUMemoryTest_version_001.pas.254: begin
0040FC5C 53 push ebx
0040FC5D 56 push esi
0040FC5E 57 push edi
0040FC5F 83C4B8 add esp,-$48
0040FC62 8BD8 mov ebx,eax
unit_TCPUMemoryTest_version_001.pas.255: QueryPerformanceCounter( vStart );
0040FC64 54 push esp
0040FC65 E82295FFFF call QueryPerformanceCounter
unit_TCPUMemoryTest_version_001.pas.257: vElementCount := mElementCount;
0040FC6A 8B430C mov eax,[ebx+$0c]
0040FC6D 89442428 mov [esp+$28],eax
unit_TCPUMemoryTest_version_001.pas.258: vBlockCount := mBlockCount;
0040FC71 8B4310 mov eax,[ebx+$10]
0040FC74 8944242C mov [esp+$2c],eax
unit_TCPUMemoryTest_version_001.pas.259: vLoopCount := mLoopCount;
0040FC78 8B4314 mov eax,[ebx+$14]
0040FC7B 89442430 mov [esp+$30],eax
unit_TCPUMemoryTest_version_001.pas.261: vBlockIndexA := 0;
0040FC7F 33C0 xor eax,eax
0040FC81 8944241C mov [esp+$1c],eax
unit_TCPUMemoryTest_version_001.pas.262: vBlockIndexB := 1;
0040FC85 C744242001000000 mov [esp+$20],$00000001
unit_TCPUMemoryTest_version_001.pas.263: vBlockIndexC := 2;
0040FC8D C744242402000000 mov [esp+$24],$00000002
0040FC95 E982000000 jmp $0040fd1c
unit_TCPUMemoryTest_version_001.pas.266: vBlockBaseA := vBlockIndexA *
vElementCount;
0040FC9A 8B44241C mov eax,[esp+$1c]
0040FC9E F76C2428 imul dword ptr [esp+$28]
0040FCA2 89442434 mov [esp+$34],eax
unit_TCPUMemoryTest_version_001.pas.267: vBlockBaseB := vBlockIndexB *
vElementCount;
0040FCA6 8B442420 mov eax,[esp+$20]
0040FCAA F76C2428 imul dword ptr [esp+$28]
0040FCAE 89442438 mov [esp+$38],eax
unit_TCPUMemoryTest_version_001.pas.268: vBlockBaseC := vBlockIndexC *
vElementCount;
0040FCB2 8B442424 mov eax,[esp+$24]
0040FCB6 F76C2428 imul dword ptr [esp+$28]
0040FCBA 8944243C mov [esp+$3c],eax
unit_TCPUMemoryTest_version_001.pas.270: vElementIndexA := 0;
0040FCBE 33C0 xor eax,eax
unit_TCPUMemoryTest_version_001.pas.271: vElementIndexB := 0;
0040FCC0 33D2 xor edx,edx
unit_TCPUMemoryTest_version_001.pas.272: vElementIndexC := 0;
0040FCC2 33F6 xor esi,esi
unit_TCPUMemoryTest_version_001.pas.274: for vLoopIndex := 0 to vLoopCount-1
do
0040FCC4 8B7C2430 mov edi,[esp+$30]
0040FCC8 4F dec edi
0040FCC9 85FF test edi,edi
0040FCCB 7C22 jl $0040fcef
0040FCCD 47 inc edi

// *** BEGIN OF INNER LOOP ***:

unit_TCPUMemoryTest_version_001.pas.276: vElementIndexA := mMemory[
vBlockBaseA + vElementIndexA ];
0040FCCE 03442434 add eax,[esp+$34]
0040FCD2 8B4B04 mov ecx,[ebx+$04]
0040FCD5 8B0481 mov eax,[ecx+eax*4]
unit_TCPUMemoryTest_version_001.pas.277: vElementIndexB := mMemory[
vBlockBaseB + vElementIndexB ];
0040FCD8 03542438 add edx,[esp+$38]
0040FCDC 8B4B04 mov ecx,[ebx+$04]
0040FCDF 8B1491 mov edx,[ecx+edx*4]
unit_TCPUMemoryTest_version_001.pas.278: vElementIndexC := mMemory[
vBlockBaseC + vElementIndexC ];
0040FCE2 0374243C add esi,[esp+$3c]
0040FCE6 8B4B04 mov ecx,[ebx+$04]
0040FCE9 8B34B1 mov esi,[ecx+esi*4]
unit_TCPUMemoryTest_version_001.pas.274: for vLoopIndex := 0 to vLoopCount-1
do
0040FCEC 4F dec edi
0040FCED 75DF jnz $0040fcce

// *** END OF INNER LOOP ***

unit_TCPUMemoryTest_version_001.pas.281: mBlockResult[ vBlockIndexA ] :=
vElementIndexA;
0040FCEF 8B4B08 mov ecx,[ebx+$08]
0040FCF2 8B7C241C mov edi,[esp+$1c]
0040FCF6 8904B9 mov [ecx+edi*4],eax
unit_TCPUMemoryTest_version_001.pas.282: mBlockResult[ vBlockIndexB ] :=
vElementIndexB;
0040FCF9 8B4308 mov eax,[ebx+$08]
0040FCFC 8B4C2420 mov ecx,[esp+$20]
0040FD00 891488 mov [eax+ecx*4],edx
unit_TCPUMemoryTest_version_001.pas.283: mBlockResult[ vBlockIndexC ] :=
vElementIndexC;
0040FD03 8B4308 mov eax,[ebx+$08]
0040FD06 8B542424 mov edx,[esp+$24]
0040FD0A 893490 mov [eax+edx*4],esi
unit_TCPUMemoryTest_version_001.pas.285: vBlockIndexA := vBlockIndexA + 3;
0040FD0D 8344241C03 add dword ptr [esp+$1c],$03
unit_TCPUMemoryTest_version_001.pas.286: vBlockIndexB := vBlockIndexB + 3;
0040FD12 8344242003 add dword ptr [esp+$20],$03
unit_TCPUMemoryTest_version_001.pas.287: vBlockIndexC := vBlockIndexC + 3;
0040FD17 8344242403 add dword ptr [esp+$24],$03
unit_TCPUMemoryTest_version_001.pas.264: while vBlockIndexA =
(vBlockCount-4) do
0040FD1C 8B44242C mov eax,[esp+$2c]
0040FD20 83E804 sub eax,$04
0040FD23 3B44241C cmp eax,[esp+$1c]
0040FD27 0F8D6DFFFFFF jnl $0040fc9a
unit_TCPUMemoryTest_version_001.pas.289: QueryPerformanceCounter( vStop );
0040FD2D 8D442408 lea eax,[esp+$08]
0040FD31 50 push eax
0040FD32 E85594FFFF call QueryPerformanceCounter
unit_TCPUMemoryTest_version_001.pas.290: QueryPerformanceFrequency(
vFrequency );
0040FD37 8D442410 lea eax,[esp+$10]
0040FD3B 50 push eax
0040FD3C E85394FFFF call QueryPerformanceFrequency
unit_TCPUMemoryTest_version_001.pas.292: mCPUExecutionTimeInSeconds :=
(vStop - vStart) / vFrequency;
0040FD41 8B442408 mov eax,[esp+$08]
0040FD45 8B54240C mov edx,[esp+$0c]
0040FD49 2B0424 sub eax,[esp]
0040FD4C 1B542404 sbb edx,[esp+$04]
0040FD50 89442440 mov [esp+$40],eax
0040FD54 89542444 mov [esp+$44],edx
0040FD58 DF6C2440 fild qword ptr [esp+$40]
0040FD5C DF6C2410 fild qword ptr [esp+$10]
0040FD60 DEF9 fdivp st(1)
0040FD62 DD5B28 fstp qword ptr [ebx+$28]
0040FD65 9B wait
unit_TCPUMemoryTest_version_001.pas.293: end;
0040FD66 83C448 add esp,$48
0040FD69 5F pop edi
0040FD6A 5E pop esi
0040FD6B 5B pop ebx
0040FD6C C3 ret

// *** End of generated assembler ***

Bye,
Skybuck.


  #12  
Old August 1st 11, 12:25 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

My initial guess at the generated assembler is:

32 bit mode simply does not have enough registers available to "pipeline" in
parallel.

Perhaps 64 bit mode has enough registers available, however instructions in
64 bit mode are twice as slow but maybe it might still be faster than 32 bit
mode.

So it could be interesting to turn the Delphi code into C/C++ code and then
try on 64 bit compiler.

However I am not really interested in C/C++ code because it's a hell lot of
work to convert all Delphi code to C/C++ code so not going to do that.

But perhaps soon as 64 bit delphi compiler will be out... I think there is
already a preview compiler somewhere.

This also leaves free pascal compiler as a possible try, which has 64 bit
compiling as well, last time I tried it it wasn't so great, but maybe it has
improved but I wouldn't hold my breath

None the less it's interesting to simply do a free pascal compile for 64 bit
mode it shouldn't be that hard to do I guess, so I am gonna give that a try
and then see if I can get at the assembler to see what it generates in 64
bit mode

Bye,
Skybuck.

  #13  
Old August 1st 11, 12:58 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Anyway, 64 bit theory tested... it seems to be much worse, perhaps 64 bit
free pascal compiler not as optimized, or perhaps 32 bit delphi compiler
just much better.

The 64 bit test program produced by free pascal 2.4.4 for x64 compiler is
twice as slow as 32 bit test program produced by Delphi XE:

Also 64 bit free pascal compiler seems to complain that some produced
instructions are invalid, yet it does seem to compile, the invalid
instructions where not in the following code as far as I could tell, so for
what it's worth here is free pascal 64 bit assembler output:

It's easy to see that it's much worse, it uses 7 instructions in the inner
loop instead of the 3 instructions which Delphi 32 bit uses.

(-O3 switch was used for free pascal compiler for optimization level 3
and -al for assembly listing with source lines):

_TEXT SEGMENT
ALIGN 16
PUBLIC UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU
UNIT_TCPUMEMORYTEST_VERSION_001_TCPUMEMORYTEST_$__ EXECUTECPU:
; Temps allocated between rsp+88 and rsp+192
; [254] begin
sub rsp,264
; Var $self located in register rax
; Var vLoopIndex located in register eax
; Var vBlockIndexA located in register r14d
; Var vBlockIndexB located in register ecx
; Var vBlockIndexC located in register ebx
; Var vElementIndexA located in register r9d
; Var vElementIndexB located in register r10d
; Var vElementIndexC located in register r11d
; Var vElementCount located in register eax
; Var vBlockCount located in register eax
; Var vLoopCount located in register r15d
; Var vBlockBaseA located in register esi
; Var vBlockBaseB located in register edi
; Var vBlockBaseC located in register r8d
mov qword ptr [rsp+120],rbx
mov qword ptr [rsp+128],rdi
mov qword ptr [rsp+136],rsi
mov qword ptr [rsp+144],r12
mov qword ptr [rsp+152],r13
mov qword ptr [rsp+160],r14
mov qword ptr [rsp+168],r15
; Var vStart located at rsp+64
; Var vStop located at rsp+72
; Var vFrequency located at rsp+80
mov qword ptr [rsp+112],rcx
; [255] QueryPerformanceCounter( vStart );
lea rcx,qword ptr [rsp+64]
call QueryPerformanceCounter
; [257] vElementCount := mElementCount;
mov rax,qword ptr [rsp+112]
mov eax,dword ptr [rax+24]
mov qword ptr [rsp+88],rax
; [258] vBlockCount := mBlockCount;
mov rax,qword ptr [rsp+112]
mov eax,dword ptr [rax+28]
mov qword ptr [rsp+104],rax
; [259] vLoopCount := mLoopCount;
mov rax,qword ptr [rsp+112]
mov r15d,dword ptr [rax+32]
; [261] vBlockIndexA := 0;
mov r14d,0
; [262] vBlockIndexB := 1;
mov ecx,1
; [263] vBlockIndexC := 2;
mov ebx,2
; [264] while vBlockIndexA = (vBlockCount-4) do
jmp @@j138
ALIGN 8
@@j137:
; [266] vBlockBaseA := vBlockIndexA * vElementCount;
movsxd rax,r14d
movsxd rdx,dword ptr [rsp+88]
imul rax,rdx
mov esi,eax
; [267] vBlockBaseB := vBlockIndexB * vElementCount;
movsxd rax,ecx
movsxd rdx,dword ptr [rsp+88]
imul rax,rdx
mov edi,eax
; [268] vBlockBaseC := vBlockIndexC * vElementCount;
movsxd rax,ebx
movsxd rdx,dword ptr [rsp+88]
imul rax,rdx
mov r8d,eax
; [270] vElementIndexA := 0;
mov r9d,0
; [271] vElementIndexB := 0;
mov r10d,0
; [272] vElementIndexC := 0;
mov r11d,0
; [274] for vLoopIndex := 0 to vLoopCount-1 do
movsxd rax,r15d
dec rax
mov r12d,eax
mov eax,0
mov qword ptr [rsp+96],rax
mov eax,dword ptr [rsp+96]
cmp r12d,eax
jl @@j153
mov eax,dword ptr [rsp+96]
dec eax
mov qword ptr [rsp+96],rax
ALIGN 8
@@j154:
mov eax,dword ptr [rsp+96]
inc eax
mov qword ptr [rsp+96],rax
; [276] vElementIndexA := mMemory[ vBlockBaseA + vElementIndexA ];
mov rax,qword ptr [rsp+112]
mov r13,qword ptr [rax+8]
movsxd rax,esi
movsxd rdx,r9d
add rax,rdx
shl rax,2
mov r9d,dword ptr [r13+rax]
; [277] vElementIndexB := mMemory[ vBlockBaseB + vElementIndexB ];
mov rdx,qword ptr [rsp+112]
mov rax,qword ptr [rdx+8]
movsxd rdx,edi
movsxd r13,r10d
add rdx,r13
shl rdx,2
mov r10d,dword ptr [rax+rdx]
; [278] vElementIndexC := mMemory[ vBlockBaseC + vElementIndexC ];
mov rdx,qword ptr [rsp+112]
mov rax,qword ptr [rdx+8]
movsxd rdx,r8d
movsxd r13,r11d
add rdx,r13
shl rdx,2
mov r11d,dword ptr [rax+rdx]
mov eax,dword ptr [rsp+96]
cmp r12d,eax
jg @@j154
@@j153:
; [281] mBlockResult[ vBlockIndexA ] := vElementIndexA;
mov rax,qword ptr [rsp+112]
mov rdx,qword ptr [rax+16]
movsxd rax,r14d
shl rax,2
mov dword ptr [rdx+rax],r9d
; [282] mBlockResult[ vBlockIndexB ] := vElementIndexB;
mov rax,qword ptr [rsp+112]
mov rdx,qword ptr [rax+16]
movsxd rax,ecx
shl rax,2
mov dword ptr [rdx+rax],r10d
; [283] mBlockResult[ vBlockIndexC ] := vElementIndexC;
mov rax,qword ptr [rsp+112]
mov rdx,qword ptr [rax+16]
movsxd rax,ebx
shl rax,2
mov dword ptr [rdx+rax],r11d
; [285] vBlockIndexA := vBlockIndexA + 3;
movsxd rax,r14d
add rax,3
mov r14d,eax
; [286] vBlockIndexB := vBlockIndexB + 3;
movsxd rax,ecx
add rax,3
mov ecx,eax
; [287] vBlockIndexC := vBlockIndexC + 3;
movsxd rax,ebx
add rax,3
mov ebx,eax
@@j138:
movsxd rax,dword ptr [rsp+104]
sub rax,4
movsxd rdx,r14d
cmp rax,rdx
jge @@j137
@@j139:
; [289] QueryPerformanceCounter( vStop );
lea rcx,qword ptr [rsp+72]
call QueryPerformanceCounter
; [290] QueryPerformanceFrequency( vFrequency );
lea rcx,qword ptr [rsp+80]
call QueryPerformanceFrequency
; [292] mCPUExecutionTimeInSeconds := (vStop - vStart) / vFrequency;
mov rax,qword ptr [rsp+72]
mov rdx,qword ptr [rsp+64]
sub rax,rdx
cvtsi2sd xmm0,rax
cvtsi2sd xmm1,qword ptr [rsp+80]
divsd xmm0,xmm1
mov rdx,qword ptr [rsp+112]
movsd [rdx+56],xmm0
; [293] end;
mov rbx,qword ptr [rsp+120]
mov rdi,qword ptr [rsp+128]
mov rsi,qword ptr [rsp+136]
mov r12,qword ptr [rsp+144]
mov r13,qword ptr [rsp+152]
mov r14,qword ptr [rsp+160]
mov r15,qword ptr [rsp+168]
add rsp,264
ret
_TEXT ENDS

Bye,
Skybuck.

  #14  
Old August 1st 11, 01:00 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

I have also requested early access/to take part in 64 bit delphi beta
program.

Maybe I will get a nice "present"

Then I can test out how delphi 64 bit compiler performs compared to free
pascal 64 bit compiler

Bye,
Skybuck


  #15  
Old August 1st 11, 09:26 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
[email protected]
external usenet poster
 
Posts: 17
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

In article ,
Bernhard Schornak wrote:

It doesn't really matter which high level language -
Pascal and C(++) compilers generate comparable code,
I guess. No automaton can replace a brain. Assembler
is the only choice for effective optimisation.


Nice try, but no banana - your guess is wrong. Currently, the
language that optimises best is still Fortran - both C and C++
more-or-less forbid it, and I doubt that any Pascal compilers are
seriously maintained for performance any longer.


Regards,
Nick Maclaren.
  #16  
Old August 1st 11, 09:45 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Bernhard Schornak
external usenet poster
 
Posts: 17
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

Skybuck Flying wrote:


My initial guess at the generated assembler is:

32 bit mode simply does not have enough registers available to "pipeline" in parallel.

Perhaps 64 bit mode has enough registers available, however instructions in 64 bit mode
are twice as slow but maybe it might still be faster than 32 bit mode.



64 bit instructions have the same latencies than the
corresponding 32 bit instructions (except 64 bit MUL
and DIV). Using EBP as general purpose register (GP)
frees one register at no costs (except you -have to-
use MOV instead of PUSH and POP).


So it could be interesting to turn the Delphi code into C/C++ code and then try on 64 bit
compiler.

However I am not really interested in C/C++ code because it's a hell lot of work to
convert all Delphi code to C/C++ code so not going to do that.



It doesn't really matter which high level language -
Pascal and C(++) compilers generate comparable code,
I guess. No automaton can replace a brain. Assembler
is the only choice for effective optimisation.


But perhaps soon as 64 bit delphi compiler will be out... I think there is already a
preview compiler somewhere.

This also leaves free pascal compiler as a possible try, which has 64 bit compiling as
well, last time I tried it it wasn't so great, but maybe it has improved but I wouldn't
hold my breath

None the less it's interesting to simply do a free pascal compile for 64 bit mode it
shouldn't be that hard to do I guess, so I am gonna give that a try and then see if I can
get at the assembler to see what it generates in 64 bit mode



Probably the same with less workarounds. 64 bit code
gains most of its speed due to parameter passing via
registers. Other optimisations and improvements were
possible, but: The 64 bit code I have seen until now
still looked like its older 32 bit brethren.

Having a short look at your 32 and 64 bit sources: I
have to translate them into something human readable
before I can start to figure out, what they actually
do. This might take a while (I am working from 06:00
until everything is done, leaving not much more than
one or two hours for anything else), but I'm sure it
is possible to make those loops (at least!) twice as
fast with some better code. I'll post some code 'til
Saturday or Sunday.


A few questions:

What is Pascal's calling convention (which registers
have to be preserved and how are parameters passed)?

Is "QueryPerformanceCounter" an API function doing a
RDTSC (read time stamp counter), writing the EDX:EAX
pair to the passed memory location (qword)?

What does the := operator do?

Do you really need a result in seconds? Recent CPUs,
be it AMD or LETNi, change the frequency of a single
core if required. Busy cores run at higher frequency
while the frequency of idle cores is slowed down for
that time. I doubt the returned value is accurate in
this case. On my Phenom II 1100T, frequency can vary
between a few hundred and 3700 MHz (no overclocking)
per core, while the processor speed is 3300 MHz (and
this probably is the "frequency" reported by the API
function "QueryPerformanceFrequency").


Greetings from Augsburg

Bernhard Schornak

  #17  
Old August 1st 11, 10:38 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

My initial guess at the generated assembler is:

32 bit mode simply does not have enough registers available to "pipeline"
in parallel.

Perhaps 64 bit mode has enough registers available, however instructions
in 64 bit mode
are twice as slow but maybe it might still be faster than 32 bit mode.



"
64 bit instructions have the same latencies than the
corresponding 32 bit instructions (except 64 bit MUL
and DIV). Using EBP as general purpose register (GP)
frees one register at no costs (except you -have to-
use MOV instead of PUSH and POP).
"

Not on my AMD X2 3800+ processor. It's a fake 64 bit processor

I think it execute 64 bit instructions as two 32 bit instructions or worse


So the 64 bit instructions have clock cycles/latency of at least 2

I am not sure about newer processors but I would expect them to be faster
but I do also expect general suckage

So it could be interesting to turn the Delphi code into C/C++ code and
then try on 64 bit
compiler.

However I am not really interested in C/C++ code because it's a hell lot
of work to
convert all Delphi code to C/C++ code so not going to do that.


"
It doesn't really matter which high level language -
Pascal and C(++) compilers generate comparable code,
I guess. No automaton can replace a brain. Assembler
is the only choice for effective optimisation.
"

Mhoaw I am not so sure about that... when it comes to register re-use the
compiler might spot something more easily than a human being in all that
code

I would expect a C/C++ compiler to be slightly faster, especially the one
from Microsoft, especially in "release mode".

Also Microsoft has a 64 bit compiler for a while now...

But perhaps soon as 64 bit delphi compiler will be out... I think there is
already a
preview compiler somewhere.

This also leaves free pascal compiler as a possible try, which has 64 bit
compiling as
well, last time I tried it it wasn't so great, but maybe it has improved
but I wouldn't
hold my breath

None the less it's interesting to simply do a free pascal compile for 64
bit mode it
shouldn't be that hard to do I guess, so I am gonna give that a try and
then see if I can
get at the assembler to see what it generates in 64 bit mode


"
Probably the same with less workarounds. 64 bit code
gains most of its speed due to parameter passing via
registers. Other optimisations and improvements were
possible, but: The 64 bit code I have seen until now
still looked like its older 32 bit brethren.
"

Yeah I have heard the same thing, the extra registers give it some more
speed.

"
Having a short look at your 32 and 64 bit sources: I
have to translate them into something human readable
before I can start to figure out, what they actually
do. This might take a while (I am working from 06:00
until everything is done, leaving not much more than
one or two hours for anything else), but I'm sure it
is possible to make those loops (at least!) twice as
fast with some better code. I'll post some code 'til
Saturday or Sunday.
"

Well I am interested, but I doubt you can do it =D

But maybe I underestimate you ! =D

"
A few questions:

What is Pascal's calling convention (which registers
have to be preserved and how are parameters passed)?
"

I was looking for the same information. It's probably somewhere on my
harddisk.

For now assume all "user mode/application mode" registers can be used.

If you unsure, simply push everything at the routine start, and pop
everything later on.

"
Is "QueryPerformanceCounter" an API function doing a
RDTSC (read time stamp counter), writing the EDX:EAX
pair to the passed memory location (qword)?
"
The API call does not have to be converted to assembler, so it doesn't
matter, you can safely ignore that.

Only the two loops are interesting.

"
What does the := operator do?
"

It's the assignment operator. For example:

A := B;

It copies B into A.

like mov B, A
(b is destination)
(a is source)

"
Do you really need a result in seconds? Recent CPUs,
be it AMD or LETNi, change the frequency of a single
core if required. Busy cores run at higher frequency
while the frequency of idle cores is slowed down for
that time. I doubt the returned value is accurate in
this case. On my Phenom II 1100T, frequency can vary
between a few hundred and 3700 MHz (no overclocking)
per core, while the processor speed is 3300 MHz (and
this probably is the "frequency" reported by the API
function "QueryPerformanceFrequency").
"

I think it's accurate enough, the timing code be varied with other timers
just in case.

But the timing can be ignored for now.

To be able to write assembler in pascal it can be done either as follows:

procedure Routine;
begin

asm
... assembler goes here...;
end;
end;


Or a full assembler routine:

procedure Routine; assembler;
asm
... assembler goes here ...;
end;

What I am looking for is simply "assembler concept code" which tries to stay
a bit realistic.

So focus on the two loops for now.

Try to produce some assembler pseudo code for the loops.

If you can do that, then that would be interesting.

And then maybe me or others can help patch up the code so it actually works
in pascal.

For now some general intel assembler instruction listing would do...

So assembler pseudo code could look like:


mov eax, BlockCount
mov ecx, ElementCount
mov edx, LoopCount
mov ebx, BlockBase

would indicate pseudo code for loading the parameters into registers.

I'll figure out how to do that eventually

also

mov BlockResult, eax

Something like that.

For now I am mostly interested in how you would allocate the limitted
ammount of registers to the loop and pipelining/parallel code

Bye,
Skybuck.

  #18  
Old August 1st 11, 10:49 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Here some information for you about Delphi's assembler.

I think the calling convention can be "register" or simply "stdcall".

This information was copied from a pdf, the copy & paste is a bit messy but might be of some use.

If you have acrobot reader then you can read the pdf he

http://www.win.tue.nl/~wstomv/edu/de...guageGuide.pdf

Scroll down or click in index: the section is called “inline assembly code (win32 only)â€.


“
Understanding Assembler Syntax (Win32 Only)


The inline assembler is available only on the Win32 Delphi compiler. The following material describes the elements

of the assembler syntax necessary for proper use.

Assembler Statement Syntax

Labels

Instruction Opcodes

Assembly Directives

Operands



Assembler Statement Syntax



This syntax of an assembly statement is



Label: Prefix Opcode Operand1, Operand2



where Label is a label, Prefix is an assembly prefix opcode (operation code), Opcode is an assembly instruction

opcode or directive, and Operand is an assembly expression. Label and Prefix are optional. Some opcodes take

only one operand, and some take none.

Comments are allowed between assembly statements, but not within them. For example,



MOV AX,1 {Initial value} { OK }

MOV CX,100 {Count} { OK }

MOV {Initial value} AX,1; { Error! }

MOV CX, {Count} 100 { Error! }



Labels



Labels are used in built-in assembly statements as they are in the Delphi languageby writing the label and a colon

before a statement. There is no limit to a label's length. As in Delphi, labels must be declared in a label declaration

part in the block containing the asm statement. The one exception to this rule is local labels.

Local labels are labels that start with an at-sign (@). They consist of an at-sign followed by one or more letters, digits,

underscores, or at-signs. Use of local labels is restricted to asm statements, and the scope of a local label extends

from the asm reserved word to the end of the asm statement that contains it. A local label doesn't have to be declared.



Instruction Opcodes



The built-in assembler supports all of the Intel-documented opcodes for general application use. Note that operating

system privileged instructions may not be supported. Specifically, the following families of instructions are supported:

Pentium family

Pentium Pro and Pentium II

Pentium III

Pentium 4



227



In addition, the built-in assembler supports the following instruction sets

AMD 3DNow! (from the AMD K6 onwards)

AMD Enhanced 3DNow! (from the AMD Athlon onwards)

For a complete description of each instruction, refer to your microprocessor documentation.



RET instruction sizing



The RET instruction opcode always generates a near return.



Automatic jump sizing



Unless otherwise directed, the built-in assembler optimizes jump instructions by automatically selecting the shortest,

and therefore most efficient, form of a jump instruction. This automatic jump sizing applies to the unconditional jump

instruction (JMP), and to all conditional jump instructions when the target is a label (not a procedure or function).

For an unconditional jump instruction (JMP), the built-in assembler generates a short jump (one-byte opcode followed

by a one-byte displacement) if the distance to the target label is 128 to 127 bytes. Otherwise it generates a near

jump (one-byte opcode followed by a two-byte displacement).

For a conditional jump instruction, a short jump (one-byte opcode followed by a one-byte displacement) is generated

if the distance to the target label is 128 to 127 bytes. Otherwise, the built-in assembler generates a short jump with

the inverse condition, which jumps over a near jump to the target label (five bytes in total). For example, the assembly

statement



JC Stop



where Stop isn't within reach of a short jump, is converted to a machine code sequence that corresponds to this:



JNC Skip

JMP Stop

Skip:



Jumps to the entry points of procedures and functions are always near.



Assembly Directives



The built-in assembler supports three assembly define directives: DB (define byte), DW (define word), and DD (define

double word). Each generates data corresponding to the comma-separated operands that follow the directive.

The DB directive generates a sequence of bytes. Each operand can be a constant expression with a value between

128 and 255, or a character string of any length. Constant expressions generate one byte of code, and strings

generate a sequence of bytes with values corresponding to the ASCII code of each character.

The DW directive generates a sequence of words. Each operand can be a constant expression with a value between

32,768 and 65,535, or an address expression. For an address expression, the built-in assembler generates a near

pointer, a word that contains the offset part of the address.

The DD directive generates a sequence of double words. Each operand can be a constant expression with a value

between 2,147,483,648 and 4,294,967,295, or an address expression. For an address expression, the built-in

assembler generates a far pointer, a word that contains the offset part of the address, followed by a word that contains

the segment part of the address.

The DQ directive defines a quad word for Int64 values.



228



The data generated by the DB, DW, and DD directives is always stored in the code segment, just like the code

generated by other built-in assembly statements. To generate uninitialized or initialized data in the data segment,

you should use Delphi var or const declarations.

Some examples of DB, DW, and DD directives follow.



asm

DB

FFH { One byte }

DB 0,99

{ Two bytes }

DB 'A'

{ Ord('A') }

DB 'Hello world...',0DH,0AH { String followed by CR/LF }

DB 12,'string' { Delphi style string }

DW 0FFFFH { One word }

DW 0,9999

{ Two words }

DW 'A'

{ Same as DB 'A',0 }

DW 'BA'

{ Same as DB 'A','B' }

DW MyVar

{ Offset of MyVar }

DW MyProc

{ Offset of MyProc }

DD 0FFFFFFFFH { One double-word }

DD 0,999999999 { Two double-words }

DD 'A'

{ Same as DB 'A',0,0,0 }

DD 'DCBA'

{ Same as DB 'A','B','C','D' }

DD MyVar

{ Pointer to MyVar }

DD MyProc

{ Pointer to MyProc }

end;



When an identifier precedes a DB, DW , or DD directive, it causes the declaration of a byte-, word-, or double-wordsized

variable at the location of the directive. For example, the assembler allows the following:



ByteVar DB ?

WordVar DW ?

IntVar DD ?

..

..

..

MOV AL,ByteVar

MOV BX,WordVar

MOV ECX,IntVar



The built-in assembler doesn't support such variable declarations. The only kind of symbol that can be defined in an

inline assembly statement is a label. All variables must be declared using Delphi syntax; the preceding construction

can be replaced by



var



229



ByteVar: Byte;

WordVar: Word;

IntVar: Integer;

..

..

..

asm

MOV AL,ByteVar

MOV BX,WordVar

MOV ECX,IntVar

end;



SMALL and LARGE can be used determine the width of a displacement:



MOV EAX, [LARGE $1234]



This instruction generates a 'normal' move with a 32-bit displacement ($00001234).



MOV EAX, [SMALL $1234]



The second instruction will generate a move with an address size override prefix and a 16-bit displacement ($1234).

SMALL can be used to save space. The following example generates an address size override and a 2-byte address

(in total three bytes)



MOV EAX, [SMALL 123]



as opposed to



MOV EAX, [123]



which will generate no address size override and a 4-byte address (in total four bytes).

Two additional directives allow assembly code to access dynamic and virtual methods: VMTOFFSET and

DMTINDEX.

VMTOFFSET retrieves the offset in bytes of the virtual method pointer table entry of the virtual method argument

from the beginning of the virtual method table (VMT). This directive needs a fully specified class name with a method

name as a parameter (for example, TExample.VirtualMethod), or an interface name and an interface method name.

DMTINDEX retrieves the dynamic method table index of the passed dynamic method. This directive also needs a

fully specified class name with a method name as a parameter, for example, TExample.DynamicMethod. To invoke

the dynamic method, call System.@CallDynaInst with the (E)SI register containing the value obtained from

DMTINDEX.



Note: Methods with the message directive are implemented as dynamic methods and can also be called using the

DMTINDEX technique. For example:



TMyClass = class

procedure x; message MYMESSAGE;

end;



The following example uses both DMTINDEX and VMTOFFSET to access dynamic and virtual methods:



230



program Project2;

type

TExample = class

procedure DynamicMethod; dynamic;

procedure VirtualMethod; virtual;

end;

procedure TExample.DynamicMethod;

begin

end;

procedure TExample.VirtualMethod;

begin

end;

procedure CallDynamicMethod(e: TExample);

asm

// Save ESI register

PUSH ESI

// Instance pointer needs to be in EAX

MOV EAX, e

// DMT entry index needs to be in (E)SI

MOV ESI, DMTINDEX TExample.DynamicMethod

// Now call the method

CALL System.@CallDynaInst

// Restore ESI register

POP ESI

end;

procedure CallVirtualMethod(e: TExample);

asm

// Instance pointer needs to be in EAX

MOV EAX, e

// Retrieve VMT table entry

MOV EDX, [EAX]

// Now call the method at offset VMTOFFSET

CALL DWORD PTR [EDX + VMTOFFSET TExample.VirtualMethod]

end;

var

e: TExample;

begin

e := TExample.Create;

try

CallDynamicMethod(e);

CallVirtualMethod(e);

finally

e.Free;

end;



231



end.



Operands



Inline assembler operands are expressions that consist of constants, registers, symbols, and operators.

Within operands, the following reserved words have predefined meanings:



Built-in assembler reserved words



AH CL DX ESP mm4 SHL WORD

AL CS EAX FS mm5 SHR xmm0

AND CX EBP GS mm6 SI xmm1

AX DH EBX HIGH mm7 SMALL xmm2

BH DI ECX LARGE MOD SP xmm3

BL DL EDI LOW NOT SS xmm4

BP CL EDX mm0 OFFSET ST xmm5

BX DMTINDEX EIP mm1 OR TBYTE xmm6

BYTE DS ES mm2 PTR TYPE xmm7

CH DWORD ESI mm3 QWORD VMTOFFSET XOR



Reserved words always take precedence over user-defined identifiers. For example,



var

Ch: Char;

..

..

..

asm

MOV CH, 1

end;



loads 1 into the CH register, not into the Ch variable. To access a user-defined symbol with the same name as a

reserved word, you must use the ampersand (&) override operator:



MOV&Ch, 1



It is best to avoid user-defined identifiers with the same names as built-in assembler reserved words.



232



Assembly Expressions (Win32 Only)



The built-in assembler evaluates all expressions as 32-bit integer values. It doesn't support floating-point and string

values, except string constants. The inline assembler is available only on the Win32 Delphi compiler.

Expressions are built from expression elements and operators, and each expression has an associated expression

class and expression type. This topic covers the following material:

Differences between Delphi and Assembler Expressions

Expression Elements

Expression Classes

Expression Types

Expression Operators



Differences between Delphi and Assembler Expressions



The most important difference between Delphi expressions and built-in assembler expressions is that assembler

expressions must resolve to a constant value. In other words, it must resolve to a value that can be computed at

compile time. For example, given the declarations



const

X = 10;

Y = 20;

var

Z: Integer;



the following is a valid statement.



asm

MOV Z,X+Y

end;



Because both X and Y are constants, the expression X + Y is a convenient way of writing the constant 30, and the

resulting instruction simply moves of the value 30 into the variable Z. But if X and Y are variables



var

X, Y: Integer;



the built-in assembler cannot compute the value of X + Y at compile time. In this case, to move the sum of X



and Y into Z you would use



asm

MOV EAX,X

ADD EAX,Y

MOV Z,EAX

end;



In a Delphi expression, a variable reference denotes the contents of the variable. But in an assembler expression,

a variable reference denotes the address of the variable. In Delphi the expression X + 4 (where X is a variable)



233



means the contents of X plus 4, while to the built-in assembler it means the contents of the word at the address four

bytes higher than the address of X. So, even though you are allowed to write



asm

MOV EAX,X+4

end;



this code doesn't load the value of X plus 4 into AX; instead, it loads the value of a word stored four bytes beyond



X. The correct way to add 4 to the contents of X is



asm

MOV EAX,X

ADD EAX,4

end;



Expression Elements



The elements of an expression are constants, registers, and symbols.



Numeric Constants



Numeric constants must be integers, and their values must be between 2,147,483,648 and 4,294,967,295.

By default, numeric constants use decimal notation, but the built-in assembler also supports binary, octal, and

hexadecimal. Binary notation is selected by writing a B after the number, octal notation by writing an O after the

number, and hexadecimal notation by writing an H after the number or a $ before the number.

Numeric constants must start with one of the digits 0 through 9 or the $ character. When you write a hexadecimal

constant using the H suffix, an extra zero is required in front of the number if the first significant digit is one of the

digits A through F. For example, 0BAD4H and $BAD4 are hexadecimal constants, but BAD4H is an identifier because

it starts with a letter.



String Constants



String constants must be enclosed in single or double quotation marks. Two consecutive quotation marks of the

same type as the enclosing quotation marks count as only one character. Here are some examples of string

constants:



'Z'

'Delphi'

'Linux'

"That's all folks"

'"That''s all folks," he said.'

'100'

'"'

"'"



String constants of any length are allowed in DB directives, and cause allocation of a sequence of bytes containing

the ASCII values of the characters in the string. In all other cases, a string constant can be no longer than four

characters and denotes a numeric value which can participate in an expression. The numeric value of a string

constant is calculated as



234



Ord(Ch1) + Ord(Ch2) shl 8 + Ord(Ch3) shl 16 + Ord(Ch4) shl 24



where Ch1 is the rightmost (last) character and Ch4 is the leftmost (first) character. If the string is shorter than four

characters, the leftmost characters are assumed to be zero. The following table shows string constants and their

numeric values.



String examples and their values



String Value



'a' 00000061H

'ba' 00006261H

'cba' 00636261H

'dcba' 64636261H

'a ' 00006120H

' a' 20202061H

'a' * 2 000000E2H

'a'-'A' 00000020H

not 'a' FFFFFF9EH



Registers



The following reserved symbols denote CPU registers in the inline assembler:



CPU registers



32-bit general purpose EAX EBX ECX EDX 32-bit pointer or index ESP EBP ESI EDI

16-bit general purpose AX BX CX DX 16-bit pointer or index SP BP SI DI

8-bit low registers AL BL CL DL 16-bit segment registers CS DS SS ES

32-bit segment registers FS GS

8-bit high registers AH BH CH DH Coprocessor register stack ST



When an operand consists solely of a register name, it is called a register operand. All registers can be used as

register operands, and some registers can be used in other contexts.

The base registers (BX and BP) and the index registers (SI and DI) can be written within square brackets to indicate

indexing. Valid base/index register combinations are [BX], [BP], [SI], [DI], [BX+SI], [BX+DI], [BP+SI], and [BP+DI].

You can also index with all the 32-bit registersfor example, [EAX+ECX], [ESP], and [ESP+EAX+5].

The segment registers (ES, CS, SS, DS, FS, and GS) are supported, but segments are normally not useful in 32-

bit applications.

The symbol ST denotes the topmost register on the 8087 floating-point register stack. Each of the eight floatingpoint

registers can be referred to using ST(X), where X is a constant between 0 and 7 indicating the distance from

the top of the register stack.



Symbols



The built-in assembler allows you to access almost all Delphi identifiers in assembly language expressions, including

constants, types, variables, procedures, and functions. In addition, the built-in assembler implements the special

symbol @Result, which corresponds to the Result variable within the body of a function. For example, the function



235



function Sum(X, Y: Integer): Integer;

begin

Result := X + Y;

end;



could be written in assembly language as



function Sum(X, Y: Integer): Integer; stdcall;

begin

asm

MOV EAX,X

ADD EAX,Y

MOV @Result,EAX

end;

end;



The following symbols cannot be used in asm statements:

Standard procedures and functions (for example, WriteLn and Chr).

String, floating-point, and set constants (except when loading registers).

Labels that aren't declared in the current block.

The @Result symbol outside of functions.

The following table summarizes the kinds of symbol that can be used in asm statements.



Symbols recognized by the built-in assembler



Symbol Value Class Type



Label Address of label Memory reference Size of type

Constant Value of constant Immediate value 0

Type 0 Memory reference Size of type

Field Offset of field Memory Size of type

Variable Address of variable or address of a pointer to the variable Memory reference Size of type

Procedure Address of procedure Memory reference Size of type

Function Address of function Memory reference Size of type

Unit 0 Immediate value 0

@Result Result variable offset Memory reference Size of type



With optimizations disabled, local variables (variables declared in procedures and functions) are always allocated

on the stack and accessed relative to EBP, and the value of a local variable symbol is its signed offset from EBP.

The assembler automatically adds [EBP] in references to local variables. For example, given the declaration



var Count: Integer;



within a function or procedure, the instruction



MOV EAX,Count



assembles into MOV EAX,[EBP4].



236



The built-in assembler treats var parameters as a 32-bit pointers, and the size of a var parameter is always 4. The

syntax for accessing a var parameter is different from that for accessing a value parameter. To access the contents

of a var parameter, you must first load the 32-bit pointer and then access the location it points to. For example,



function Sum(var X, Y: Integer): Integer; stdcall;

begin

asm

MOV EAX,X

MOV EAX,[EAX]

MOV EDX,Y

ADD EAX,[EDX]

MOV @Result,EAX

end;

end;



Identifiers can be qualified within asm statements. For example, given the declarations



type

TPoint = record

X, Y: Integer;

end;

TRect = record

A, B: TPoint;

end;

var

P: TPoint;

R: TRect;



the following constructions can be used in an asm statement to access fields.



MOV EAX,P.X

MOV EDX,P.Y

MOV ECX,R.A.X

MOV EBX,R.B.Y



A type identifier can be used to construct variables on the fly. Each of the following instructions generates the same

machine code, which loads the contents of [EDX] into EAX.



MOV EAX,(TRect PTR [EDX]).B.X

MOV EAX,TRect([EDX]).B.X

MOV EAX,TRect[EDX].B.X

MOV EAX,[EDX].TRect.B.X



Expression Classes



The built-in assembler divides expressions into three classes: registers, memory references, and immediate values.

An expression that consists solely of a register name is a register expression. Examples of register expressions are

AX, CL, DI, and ES. Used as operands, register expressions direct the assembler to generate instructions that

operate on the CPU registers.

Expressions that denote memory locations are memory references. Delphi's labels, variables, typed constants,

procedures, and functions belong to this category.



237



Expressions that aren't registers and aren't associated with memory locations are immediate values. This group

includes Delphi's untyped constants and type identifiers.

Immediate values and memory references cause different code to be generated when used as operands. For

example,



const

Start = 10;

var

Count: Integer;

..

..

..

asm

MOV EAX,Start { MOV EAX,xxxx }

MOV EBX,Count { MOV EBX,[xxxx] }

MOV ECX,[Start] { MOV ECX,[xxxx] }

MOV EDX,OFFSET Count { MOV EDX,xxxx }

end;



Because Start is an immediate value, the first MOV is assembled into a move immediate instruction. The second

MOV, however, is translated into a move memory instruction, as Count is a memory reference. In the third MOV,

the brackets convert Start into a memory reference (in this case, the word at offset 10 in the data segment). In the

fourth MOV, the OFFSET operator converts Count into an immediate value (the offset of Count in the data segment).

The brackets and OFFSET operator complement each other. The following asm statement produces identical

machine code to the first two lines of the previous asm statement.



asm

MOV EAX,OFFSET [Start]

MOV EBX,[OFFSET Count]

end;



Memory references and immediate values are further classified as either relocatable or absolute. Relocation is the

process by which the linker assigns absolute addresses to symbols. A relocatable expression denotes a value that

requires relocation at link time, while an absolute expression denotes a value that requires no such relocation.

Typically, expressions that refer to labels, variables, procedures, or functions are relocatable, since the final address

of these symbols is unknown at compile time. Expressions that operate solely on constants are absolute.

The built-in assembler allows you to carry out any operation on an absolute value, but it restricts operations on

relocatable values to addition and subtraction of constants.



Expression Types



Every built-in assembler expression has a typeor, more correctly, a size, because the assembler regards the type

of an expression simply as the size of its memory location. For example, the type of an Integer variable is four,

because it occupies 4 bytes. The built-in assembler performs type checking whenever possible, so in the instructions



var

QuitFlag: Boolean;

OutBufPtr: Word;

..

..

..

asm



238



MOV AL,QuitFlag

MOV BX,OutBufPtr

end;



the assembler checks that the size of QuitFlag is one (a byte), and that the size of OutBufPtr is two (a word).

The instruction



MOV DL,OutBufPtr



produces an error because DL is a byte-sized register and OutBufPtr is a word. The type of a memory reference

can be changed through a typecast; these are correct ways of writing the previous instruction:



MOV DL,BYTE PTR OutBufPtr

MOV DL,Byte(OutBufPtr)

MOV DL,OutBufPtr.Byte



These MOV instructions all refer to the first (least significant) byte of the OutBufPtr variable.

In some cases, a memory reference is untyped. One example is an immediate value (Buffer) enclosed in square

brackets:



procedure Example(var Buffer);

asm

MOV AL, [Buffer]

MOV CX, [Buffer]

MOV EDX, [Buffer]

end;



The built-in assembler permits these instructions, because the expression [Buffer] has no typeit just means "the

contents of the location indicated by Buffer," and the type can be determined from the first operand (byte for AL,

word for CX, and double-word for EDX.

In cases where the type can't be determined from another operand, the built-in assembler requires an explicit

typecast. For example,



INC BYTE PTR [ECX]

IMUL WORD PTR [EDX]



The following table summarizes the predefined type symbols that the built-in assembler provides in addition to any

currently declared Delphi types.



Predefined type symbols



Symbol Type



BYTE 1

WORD 2

DWORD 4

QWORD 8

TBYTE 10



239



Expression Operators



The built-in assembler provides a variety of operators. Precedence rules are different from that of the Delphi

language; for example, in an asm statement, AND has lower precedence than the addition and subtraction operators.

The following table lists the built-in assembler's expression operators in decreasing order of precedence.



Precedence of built-in assembler expression operators



Operators Remarks Precedence



& highest



(... ), [... ],., HIGH, LOW



+, - unary + and -

:



OFFSET, TYPE, PTR, *, /, MOD, SHL, SHR, +, - binary + and -



NOT, AND, OR, XOR lowest



The following table defines the built-in assembler's expression operators.



Definitions of built-in assembler expression operators



Operator Description



& Identifier override. The identifier immediately following the ampersand is treated as a user-defined symbol, even

if the spelling is the same as a built-in assembler reserved symbol.



(... ) Subexpression. Expressions within parentheses are evaluated completely prior to being treated as a single

expression element. Another expression can precede the expression within the parentheses; the result in this case

is the sum of the values of the two expressions, with the type of the first expression.



[... ] Memory reference. The expression within brackets is evaluated completely prior to being treated as a single

expression element. Another expression can precede the expression within the brackets; the result in this case is

the sum of the values of the two expressions, with the type of the first expression. The result is always a memory

reference.



.. Structure member selector. The result is the sum of the expression before the period and the expression after

the period, with the type of the expression after the period. Symbols belonging to the scope identified by the

expression before the period can be accessed in the expression after the period.

HIGH Returns the high-order 8 bits of the word-sized expression following the operator. The expression must be an

absolute immediate value.

LOW Returns the low-order 8 bits of the word-sized expression following the operator. The expression must be an

absolute immediate value.



+ Unary plus. Returns the expression following the plus with no changes. The expression must be an absolute

immediate value.



- Unary minus. Returns the negated value of the expression following the minus. The expression must be an absolute

immediate value.



+ Addition. The expressions can be immediate values or memory references, but only one of the expressions can

be a relocatable value. If one of the expressions is a relocatable value, the result is also a relocatable value. If either

of the expressions is a memory reference, the result is also a memory reference.



- Subtraction. The first expression can have any class, but the second expression must be an absolute immediate

value. The result has the same class as the first expression.



: Segment override. Instructs the assembler that the expression after the colon belongs to the segment given by

the segment register name (CS, DS, SS, FS, GS, or ES) before the colon. The result is a memory reference with

the value of the expression after the colon. When a segment override is used in an instruction operand, the

instruction is prefixed with an appropriate segment-override prefix instruction to ensure that the indicated segment

is selected.

OFFSET Returns the offset part (double word) of the expression following the operator. The result is an immediate value.



240



TYPE Returns the type (size in bytes) of the expression following the operator. The type of an immediate value is 0.

PTR Typecast operator. The result is a memory reference with the value of the expression following the operator and

the type of the expression in front of the operator.



* Multiplication. Both expressions must be absolute immediate values, and the result is an absolute immediate

value.



/ Integer division. Both expressions must be absolute immediate values, and the result is an absolute immediate

value.

MOD Remainder after integer division. Both expressions must be absolute immediate values, and the result is an

absolute immediate value.

SHL Logical shift left. Both expressions must be absolute immediate values, and the result is an absolute immediate

value.

SHR Logical shift right. Both expressions must be absolute immediate values, and the result is an absolute immediate

value.

NOT Bitwise negation. The expression must be an absolute immediate value, and the result is an absolute immediate

value.

AND Bitwise AND. Both expressions must be absolute immediate values, and the result is an absolute immediate value.

OR Bitwise OR. Both expressions must be absolute immediate values, and the result is an absolute immediate value.

XOR Bitwise exclusive OR. Both expressions must be absolute immediate values, and the result is an absolute

immediate value.



241



Assembly Procedures and Functions (Win32 Only)



You can write complete procedures and functions using inline assembly language code, without including a



begin...end statement. This topic covers these issues:

Compiler Optimizations.

Function Results.

The inline assembler is available only on the Win32 Delphi compiler.



Compiler Optimizations



An example of the type of function you can write is as follows:



function LongMul(X, Y: Integer): Longint;

asm

MOV EAX,X

IMUL Y

end;



The compiler performs several optimizations on these routines:

No code is generated to copy value parameters into local variables. This affects all string-type value parameters

and other value parameters whose size isn't 1, 2, or 4 bytes. Within the routine, such parameters must be treated

as if they were var parameters.

Unless a function returns a string, variant, or interface reference, the compiler doesn't allocate a function result

variable; a reference to the @Result symbol is an error. For strings, variants, and interfaces, the caller always

allocates an @Result pointer.

The compiler only generates stack frames for nested routines, for routines that have local parameters, or for

routines that have parameters on the stack.

Locals is the size of the local variables and Params is the size of the parameters. If both Locals and Params

are zero, there is no entry code, and the exit code consists simply of a RET instruction.

The automatically generated entry and exit code for the routine looks like this:



PUSH EBP ;Present if Locals 0 or Params 0

MOV EBP,ESP ;Present if Locals 0 or Params 0

SUB ESP,Locals ;Present if Locals 0

....

MOV ESP,EBP ;Present if Locals 0

POP EBP ;Present if Locals 0 or Params 0

RET Params ;Always present



If locals include variants, long strings, or interfaces, they are initialized to zero but not finalized.



Function Results



Assembly language functions return their results as follows.

Ordinal values are returned in AL (8-bit values), AX (16-bit values), or EAX (32-bit values).

Real values are returned in ST(0) on the coprocessor's register stack. (Currency values are scaled by 10000.)



242



Pointers, including long strings, are returned in EAX.

Short strings and variants are returned in the temporary location pointed to by @Result.

"



Bye,

Skybuck.





  #19  
Old August 1st 11, 10:52 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Damn acrobat sucks really bad at copy&pasting.

It somehow didn't copy & paste the most important part so I ll try again
he

"
Using Registers

In general, the rules of register use in an asm statement are the same as
those of an external procedure or function.

An asm statement must preserve the EDI, ESI, ESP, EBP, and EBX registers,
but can freely modify the EAX, ECX,

and EDX registers. On entry to an asm statement, EBP points to the current
stack frame and ESP points to the top

of the stack. Except for ESP and EBP, an asm statement can assume nothing
about register contents on entry to

the statement.
"

Bye,
Skybuck.

  #20  
Old August 1st 11, 11:19 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Also in case you have troubles understanding the pascal code then I shall
try and write down some simply and easy to understand pseudo code and some
explanations/comments, let's see:

What the code does and assumes is the following:

general description, this information not that important but might give you
the general idea:

"Memory" is a pointer towards a big linear/sequentially memory block of
bytes/integers. The memory block contains many integers. These integers are
"virtually" subdivided into seperate blocks.

Each block contains 8000 integers. This is indicated by ElementCount, so
ElementCount equals 8000.

There are 4000 blocks. This is indicated by BlockCount, so BlockCount equals
4000.

So in total there are 4000 blocks * 8000 elements * 4 bytes = xxxx bytes of
memory.

Each block is like a chain of indexes.

Each index point towards the next index within it's block.

The indexes are thrown around inside the block during initialisation to
create a random access pattern.

This is also known as the "pointer chasing problem" so it could also be
called the "index chasing problem".

Index - Index - Index - Index - Index and so forth.

The first index must be retrieved from memory, only then does one know where
the next index is located and so on.

There is an ElementIndex variable which is initialized to zero.

So each ElementIndex simply starts at element 0 of block 0.

There could be 8000 ElementIndexes in parallel all starting at element 0 of
block X.

Or there could simply be on ElementIndex and process each block in turn.

in version 0.03 "BlockBase" was introduced.

BlockBase as an index which points towards the first element of a block.

So it's a storage to prevent multiplications all the time.

In the code below 3 blocks in parallel are attempted.

Now for some further pseudo code:

// routine

// variables:

// each block is processed repeatedly loop count times.
// loop index indicates the current loop/round
LoopIndex

BlockIndexA // indicates block a number
BlockIndexB // indicates block b number
BlockIndexC // indicates block c number

ElementIndexA // indicates block a element index
ElementIndexB // indicates block b element index
ElementIndexC // indicates block c element index

ElementCount // number of elements per block
BlockCount // number of blocks
LoopCount // number of loops/chases per block

BlockBaseA // starting index of block a
BlockBaseB // starting index of block b
BlockBaseC // starting index of block c

Memory // contains all integers of all blocks of all elements. So it's an
array/block of elements/integers.

RoutineBegin

ElementCount = 8000
BlockCount = 4000
LoopCount = 80000

BlockIndexA = 0
BlockIndexB = 1
BlockIndexC = 2

// loop over all blocks, process 3 blocks per loop iteration.
// "BlockIndex goes from 0 to 7999 divided over 3 indexes"
// so BlockIndexA is 0, 3, 6, 9, 12, etc
// so BlockIndexB is 1, 4, 7, 10, 13, etc
// so BlockIndexC is 2, 5, 8, 11, 14, etc
FirstLoopBegin

// calculate the starting index of each block.
// formula is: Base/Index/Offset = block number * number of elements per
block.
BlockBaseA = BlockIndexA * vElementCount;
BlockBaseB = BlockIndexB * vElementCount;
BlockBaseC = BlockIndexC * vElementCount;

// initialise each element index to the first index of the block.
ElementIndexA = 0
ElementIndexB = 0
ElementIndexC = 0

// loop 80000 times through the block arrays/elements
SecondLoopBegin "LoopIndex goes from 0 to 79999"

// Seek into memory at location BlockBase + ElementIndex
// do this for A, B and C:
// retrieve the new element index via the old element index
ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]

SecondLoopEnd

// store some bull**** in block result, this step could be skipped.
// but let's not so compiler doesn't optimize anything away
// perhaps block result should be printed just in case.
// the last retrieved index is store into it.
// do this for A,B,C

BlockResult[ BlockIndexA ] = vElementIndexA
BlockResult[ BlockIndexB ] = vElementIndexB
BlockResult[ BlockIndexC ] = vElementIndexC

// once the 3 blocks have been processed go to the next 3 blocks
BlockIndexA = vBlockIndexA + 3
BlockIndexB = vBlockIndexB + 3
BlockIndexC = vBlockIndexC + 3

FirstLoopEnd

Perhaps this clearifies it a little bit.

Let me know if you have any further questions

Bye,
Skybuck.

 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
An idea how to speed up computer programs and avoid waiting. ("event driven memory system") Skybuck Flying[_7_] Nvidia Videocards 22 August 15th 11 03:14 AM
Dimension 8400 w/intel 670 3.8gig processor "Thermal Event" Brad[_3_] Dell Computers 44 April 23rd 11 11:09 PM
Can't "unsync" memory bus speed (A8V-E SE) Hackworth Asus Motherboards 2 September 6th 06 05:28 AM
P5WD2-E system "hang" after memory size [email protected] Asus Motherboards 12 July 8th 06 11:24 PM


All times are GMT +1. The time now is 11:14 AM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.