A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » Video Cards » Nvidia Videocards
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

An idea how to speed up computer programs and avoid waiting. ("event driven memory system")



 
 
Thread Tools Display Modes
  #21  
Old August 12th 11, 10:37 PM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Bernhard Schornak
external usenet poster
 
Posts: 17
Default An idea how to speed up computer programs and avoid waiting.("event driven memory system")

wolfgang kern wrote:


Bernhard Schornak wrote:
...
L00:
mov ecx,[esp] # ECX = 0x00[ESP]
mov ebx,esi # EBX = ESI
shl ebx,cl # EBX = EBX CL
xor edx,edx # EDX = 0
mov ecx,[esp+$04] # ECX = 0x04[ESP]
dec ecx # ECX - 1
test ecx,ecx # redundant
jb L02 # outer loop if sign
inc ecx # ECX + 1


DEC wont alter carry, so "jb" aka "jc" should be replaced by "jng"or "js".



Hi! Thanks for refreshing my knowledge base ... I
haven't seen much more than the steering wheel of
an Atego 1222 and thousands of waybills for about
nine months, now.


...
@ Wolfgang: Both loops do work properly. In the worst
case (value is zero), these loops count down the full
32 bit range.


OTOH what I see is:

dec ecx
jng ...

actually checks is if ecx were zero or negative before the DEC,
so I'd had just

test ecx,ecx
jng ... ;jumps on zero- or sign- or overflow -flag

as this will imply a zero detection.



Right. Hence, the dec/inc pairs are redundant for
checking the range of EDX and ECX. Freeing EBP as
GPR allows to replace the outer loop counter [ESP
+0x08] with a register. Just these three cosmetic
changes saved 5 * 4,000 = 20,000 clocks...

....not a real improvement for processing a 130 MB
array, randomly accessed 320,000,000 times...

My suggestion to expand the 8,000 to 8,192 dwords
could reduce all range checks to

and ecx,0x1FFF
je ...

leaves a valid index in ECX, and skips processing
if ECX = 0. Same with EDX (anded with 0x0FFF).


And for a biased range ie:

cmp ecx,3
jng ... ;jumps if ecx = 3 or less (signed)



A sometimes required operation. In most cases, it
is better to define a valid range and "transpose"
it to something counted up or down to zero, using
appropriate offsets "compensating" the transposed
index.

Unfortunately, there seem to be some addresses in
the first elements of each block, so the properly
coded loop had to check for the lower limit - the
real array starts at offset 0x18 - as well. Slows
down the code with two additional branches. Looks
like HeLL, smells like HeLL, nua ös Design is ned
goa so hell...


Greetings from Augsburg

Bernhard Schornak
  #22  
Old August 15th 11, 03:11 AM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Ok,

I am a bit surprised I didn't respond yet to this posting of yours but that
might be because most of it is true and I have nothing further to add,
except ofcourse confirming that it was indeed a miss-understanding somewhat.

But it was also wishfull thinking.

I was hoping that

[mem + 0x04] somehow ment that it was using "mem" as a base address and
that "0x04" would contain another pointer which would point to the random
memory location.

I am not much of an x86 programmer and each assembler probably has it's own
pointer syntax/whatever.

I am glad you cleared this up.

I also don't hold your example against you because at the time of writing I
did probably not yet explain what the program was doing.

Only after your posting did I explain further...

So I think it's good of me to clearify that further just to be clear that
you were not trying to sabotage this thread with a wrong example.

At the time you probably simply didn't understand what I was trying to do...
or maybe you did understand but provided a sequential example anyway.

Only you know the thruth what was in your head at the time and true
intention.... though a sequential example which more or less assumes that
data is inside the cache could still be interesting.

So there could still be some value in this... perhaps my code is already
doing something like this... but I guess not because it's not really giving
any higher performance but that might be because the data set is too large.

If the elements were reduced from 8000 to 800 then 3 pairs of blocks might
fit in the cache and then maybe your code example might still have some
value.

However not really since the single program is probably already maxing out
the system/cpu.

It's not the cache or memory which is holding back the program it's simply
the ammount of instructions.

SSE is not going to help since there is no SSE instruction which can
retrieve multiple random memory access locations. So to bad for that.

If you truely want to write a faster program in assembler, and especially a
significantly more faster program than you would either:

1. Have to reduce the number of instructions further which will probably be
hard to do (inside the inner loop).

or

2. Find a way to "fetch" multiple elements per cycle (per instruction).

My take on it is: 1 might be possible, but 2 probably not.

Also:

3. Writing 64 bit program is probably useless, since 64 bit instructions
execute twice as slow, at least on my processor, perhaps you should test
this on your phenom and see what happens

However under different circumstances perhaps you can do better, like
not-fitting-in-cache-circumstances. But this again would probably require
some kind of "anti-stalling" "anti-waiting" code

Which was what my original posting was more or less about... letting code
proceed as much as possible and jumping back when memory results are in...
something in that trend...

Perhaps you are now starting to see that my posting is about something new
or extra which might not be possible with current hardware, though you do
keep insisting that it is possible I believe you 50% a little bit... even if
it's possible it will be little, you still have to convince me for 100%
though... my lack of time as still prevented me from running your program.

Perhaps I don't want to know results for now since it wouldn't really be
that helpfull I guess

However if you could somehow re-write your program from "optimized
assembler" back to "free pascal code" in such a way that the free pascal
compiler produces more or less the same code then that would be highly
interesting ! Especially if the code is indeed faster for certain
circumstances ! It would probably still not be interesting for my current
project but very maybe future projects.

However I see a big mistake in your reasoning of optimizing your program,
which is at the same time a big mistake in my reasoning and original post.

"CUDA/GPU'S" can probably already do what I propose to a certain extent at
least... and they probably do it much better than CPU... which means
x86/intel/amd potentially has huge problem, since GPU can do something which
their older processors apperently cannot which is: process very large data
sets in a random fashion.

However I am unsure about ATI cards and the newer AMD gpu's even INTEL seems
to have "gpu's embedded"... I find intel kinda secretive about that... they
are not coming forward with details... they probably did the same during the
80486 age... they keep "secrets" "secret" to give their selected programmers
an adventage over others... not a good practice for us it seems, which makes
me less enthousiatic to buy their processors.

nvidia's gpu's seem better documented especially cuda... but I am thinking
this is necessary to attract more programmers to it... cuda has little
benefits for now... but maybe that will change in future... nvidia in
bussiness with big companies: apple and microsoft and probably also google
and maybe even more... they positioned pretty well... I do wonder what will
happen to their documentation if they do make a big x86 killer... perhaps
they will become more secretive again which would suck. At least that's how
I experience it a little bit... perhaps the things I wrote about intel might
not be entirely true... but that is how I feel about the latest/greatest...
or maybe their documentation is just simply lagging behind a little bit...
or I didn't look... or it doesn't concern me yet since I don't have those
chips or proper simulators

I have been programming x86 for at least 18 years as well or so... and I
still don't have a proper x86 similator which kinda sucks !

Having one which could show "ammount of cycles" taken by programs would be
great. Quite strange that it apperently doesn't exist, or is too
****ty/complex/I can't understand it or whatever or takes too long to
execute ?! Hmm...

This is what I do like about "virtual instruction sets"... total insight
into how everything executes and such !

This probably explains why there are so many virtual instruction sets:

javascript, java, php, .net, flash, other scripting languages.

^ Sad thing is these all have security issue's so doesn't really help at
protecting computers, it takes only one hole to screw up a computer
But it does make computers run slower

Bye,
Skybuck.

  #23  
Old August 15th 11, 03:14 AM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

"
This probably explains why there are so many virtual instruction sets:

javascript, java, php, .net, flash, other scripting languages.

^ Sad thing is these all have security issue's so doesn't really help at
protecting computers, it takes only one hole to screw up a computer
But it does make computers run slower
"

Oh I forgot one, which could be a pretty serious one in futu

How could I forget my favorite one at the moment:

"PTX"

And perhaps I should also mention "redcode" but it's not really that serious
more ment for fun

But PTX is ofcourse serious.

And there is also AMD's version "vm" or something...

Which suddenly makes me wonder what intel's instruction set is for their
integrated cpu+gpu thingy... hmmm

Bye,
Skybuck.

 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Dimension 8400 w/intel 670 3.8gig processor "Thermal Event" Brad[_3_] Dell Computers 44 April 23rd 11 11:09 PM
Idea for Quake 3/Live: "Skill Glow" Skybuck Flying[_2_] Nvidia Videocards 1 February 22nd 09 09:34 AM
Can't "unsync" memory bus speed (A8V-E SE) Hackworth Asus Motherboards 2 September 6th 06 05:28 AM
P5WD2-E system "hang" after memory size [email protected] Asus Motherboards 12 July 8th 06 11:24 PM


All times are GMT +1. The time now is 09:30 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.