View Single Post
  #21  
Old August 12th 11, 10:37 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Bernhard Schornak
external usenet poster
 
Posts: 17
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

wolfgang kern wrote:


Bernhard Schornak wrote:
...
L00:
mov ecx,[esp] # ECX = 0x00[ESP]
mov ebx,esi # EBX = ESI
shl ebx,cl # EBX = EBX CL
xor edx,edx # EDX = 0
mov ecx,[esp+$04] # ECX = 0x04[ESP]
dec ecx # ECX - 1
test ecx,ecx # redundant
jb L02 # outer loop if sign
inc ecx # ECX + 1


DEC wont alter carry, so "jb" aka "jc" should be replaced by "jng"or "js".



Hi! Thanks for refreshing my knowledge base ... I
haven't seen much more than the steering wheel of
an Atego 1222 and thousands of waybills for about
nine months, now.


...
@ Wolfgang: Both loops do work properly. In the worst
case (value is zero), these loops count down the full
32 bit range.


OTOH what I see is:

dec ecx
jng ...

actually checks is if ecx were zero or negative before the DEC,
so I'd had just

test ecx,ecx
jng ... ;jumps on zero- or sign- or overflow -flag

as this will imply a zero detection.



Right. Hence, the dec/inc pairs are redundant for
checking the range of EDX and ECX. Freeing EBP as
GPR allows to replace the outer loop counter [ESP
+0x08] with a register. Just these three cosmetic
changes saved 5 * 4,000 = 20,000 clocks...

....not a real improvement for processing a 130 MB
array, randomly accessed 320,000,000 times...

My suggestion to expand the 8,000 to 8,192 dwords
could reduce all range checks to

and ecx,0x1FFF
je ...

leaves a valid index in ECX, and skips processing
if ECX = 0. Same with EDX (anded with 0x0FFF).


And for a biased range ie:

cmp ecx,3
jng ... ;jumps if ecx = 3 or less (signed)



A sometimes required operation. In most cases, it
is better to define a valid range and "transpose"
it to something counted up or down to zero, using
appropriate offsets "compensating" the transposed
index.

Unfortunately, there seem to be some addresses in
the first elements of each block, so the properly
coded loop had to check for the lower limit - the
real array starts at offset 0x18 - as well. Slows
down the code with two additional branches. Looks
like HeLL, smells like HeLL, nua ös Design is ned
goa so hell...


Greetings from Augsburg

Bernhard Schornak