View Single Post
  #26  
Old August 3rd 11, 09:52 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Bernhard Schornak
external usenet poster
 
Posts: 17
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

Skybuck Flying wrote:


My initial guess at the generated assembler is:

32 bit mode simply does not have enough registers available to "pipeline" in parallel.

Perhaps 64 bit mode has enough registers available, however instructions in 64 bit mode
are twice as slow but maybe it might still be faster than 32 bit mode.



"
64 bit instructions have the same latencies than the
corresponding 32 bit instructions (except 64 bit MUL
and DIV). Using EBP as general purpose register (GP)
frees one register at no costs (except you -have to-
use MOV instead of PUSH and POP).
"

Not on my AMD X2 3800+ processor. It's a fake 64 bit processor



If it is an Athlon (the X2 probably means two cores?),
it should be 64 bit. Depending on family and stepping,
more (or less) XMM capabilities are provided.


I think it execute 64 bit instructions as two 32 bit instructions or worse

So the 64 bit instructions have clock cycles/latency of at least 2



If I look at the code you posted, it cannot run faster
than the 32 bit version.


I am not sure about newer processors but I would expect them to be faster but I do also
expect general suckage



My 64 bit Windows 7 is fully usable in less than 30 s,
regardless if it was powered down or rebooted.


So it could be interesting to turn the Delphi code into C/C++ code and then try on 64 bit
compiler.

However I am not really interested in C/C++ code because it's a hell lot of work to
convert all Delphi code to C/C++ code so not going to do that.


"
It doesn't really matter which high level language -
Pascal and C(++) compilers generate comparable code,
I guess. No automaton can replace a brain. Assembler
is the only choice for effective optimisation.
"

Mhoaw I am not so sure about that... when it comes to register re-use the compiler might
spot something more easily than a human being in all that code



A look at the posted code tells us the opposite...


I would expect a C/C++ compiler to be slightly faster, especially the one from Microsoft,
especially in "release mode".

Also Microsoft has a 64 bit compiler for a while now...



I use AS (part of the minGW-64 compiler suite).


But perhaps soon as 64 bit delphi compiler will be out... I think there is already a
preview compiler somewhere.

This also leaves free pascal compiler as a possible try, which has 64 bit compiling as
well, last time I tried it it wasn't so great, but maybe it has improved but I wouldn't
hold my breath

None the less it's interesting to simply do a free pascal compile for 64 bit mode it
shouldn't be that hard to do I guess, so I am gonna give that a try and then see if I can
get at the assembler to see what it generates in 64 bit mode


"
Probably the same with less workarounds. 64 bit code
gains most of its speed due to parameter passing via
registers. Other optimisations and improvements were
possible, but: The 64 bit code I have seen until now
still looked like its older 32 bit brethren.
"

Yeah I have heard the same thing, the extra registers give it some more speed.

"
Having a short look at your 32 and 64 bit sources: I
have to translate them into something human readable
before I can start to figure out, what they actually
do. This might take a while (I am working from 06:00
until everything is done, leaving not much more than
one or two hours for anything else), but I'm sure it
is possible to make those loops (at least!) twice as
fast with some better code. I'll post some code 'til
Saturday or Sunday.
"

Well I am interested, but I doubt you can do it =D

But maybe I underestimate you ! =D



Maybe...


snip

Do you really need a result in seconds? Recent CPUs,
be it AMD or LETNi, change the frequency of a single
core if required. Busy cores run at higher frequency
while the frequency of idle cores is slowed down for
that time. I doubt the returned value is accurate in
this case. On my Phenom II 1100T, frequency can vary
between a few hundred and 3700 MHz (no overclocking)
per core, while the processor speed is 3300 MHz (and
this probably is the "frequency" reported by the API
function "QueryPerformanceFrequency").
"

I think it's accurate enough, the timing code be varied with other timers just in case.



Read the description at MS knowledgebase. My guess hit
the nail's head.


One more question:

Which parameters do you pass to the function? How does
the code get the address of the memory block it should
process?


First improvement:

This sequence

movsxd rax,r14d
shl rax,2
mov dword ptr [rdx+rax],r9d

is equal to the single instruction

mov dword ptr [rdx+r14*4),r9d

This is repeated three times, adding six -superfluous-
clock cycles per iteration. 6 * 80,000 = 480,000 saved
cycles.

Do you need positive and negative indices? If not, six
more clocks per iteration can be saved. And this is no
optimization - it's just a correction of flaws...


Greetings from Augsburg

Bernhard Schornak