View Single Post
  #22  
Old August 15th 11, 03:11 AM posted to alt.comp.lang.borland-delphi,alt.comp.periphs.videocards.nvidia,alt.lang.asm,comp.arch,rec.games.corewar
Skybuck Flying[_7_]
external usenet poster
 
Posts: 460
Default An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

Ok,

I am a bit surprised I didn't respond yet to this posting of yours but that
might be because most of it is true and I have nothing further to add,
except ofcourse confirming that it was indeed a miss-understanding somewhat.

But it was also wishfull thinking.

I was hoping that

[mem + 0x04] somehow ment that it was using "mem" as a base address and
that "0x04" would contain another pointer which would point to the random
memory location.

I am not much of an x86 programmer and each assembler probably has it's own
pointer syntax/whatever.

I am glad you cleared this up.

I also don't hold your example against you because at the time of writing I
did probably not yet explain what the program was doing.

Only after your posting did I explain further...

So I think it's good of me to clearify that further just to be clear that
you were not trying to sabotage this thread with a wrong example.

At the time you probably simply didn't understand what I was trying to do...
or maybe you did understand but provided a sequential example anyway.

Only you know the thruth what was in your head at the time and true
intention.... though a sequential example which more or less assumes that
data is inside the cache could still be interesting.

So there could still be some value in this... perhaps my code is already
doing something like this... but I guess not because it's not really giving
any higher performance but that might be because the data set is too large.

If the elements were reduced from 8000 to 800 then 3 pairs of blocks might
fit in the cache and then maybe your code example might still have some
value.

However not really since the single program is probably already maxing out
the system/cpu.

It's not the cache or memory which is holding back the program it's simply
the ammount of instructions.

SSE is not going to help since there is no SSE instruction which can
retrieve multiple random memory access locations. So to bad for that.

If you truely want to write a faster program in assembler, and especially a
significantly more faster program than you would either:

1. Have to reduce the number of instructions further which will probably be
hard to do (inside the inner loop).

or

2. Find a way to "fetch" multiple elements per cycle (per instruction).

My take on it is: 1 might be possible, but 2 probably not.

Also:

3. Writing 64 bit program is probably useless, since 64 bit instructions
execute twice as slow, at least on my processor, perhaps you should test
this on your phenom and see what happens

However under different circumstances perhaps you can do better, like
not-fitting-in-cache-circumstances. But this again would probably require
some kind of "anti-stalling" "anti-waiting" code

Which was what my original posting was more or less about... letting code
proceed as much as possible and jumping back when memory results are in...
something in that trend...

Perhaps you are now starting to see that my posting is about something new
or extra which might not be possible with current hardware, though you do
keep insisting that it is possible I believe you 50% a little bit... even if
it's possible it will be little, you still have to convince me for 100%
though... my lack of time as still prevented me from running your program.

Perhaps I don't want to know results for now since it wouldn't really be
that helpfull I guess

However if you could somehow re-write your program from "optimized
assembler" back to "free pascal code" in such a way that the free pascal
compiler produces more or less the same code then that would be highly
interesting ! Especially if the code is indeed faster for certain
circumstances ! It would probably still not be interesting for my current
project but very maybe future projects.

However I see a big mistake in your reasoning of optimizing your program,
which is at the same time a big mistake in my reasoning and original post.

"CUDA/GPU'S" can probably already do what I propose to a certain extent at
least... and they probably do it much better than CPU... which means
x86/intel/amd potentially has huge problem, since GPU can do something which
their older processors apperently cannot which is: process very large data
sets in a random fashion.

However I am unsure about ATI cards and the newer AMD gpu's even INTEL seems
to have "gpu's embedded"... I find intel kinda secretive about that... they
are not coming forward with details... they probably did the same during the
80486 age... they keep "secrets" "secret" to give their selected programmers
an adventage over others... not a good practice for us it seems, which makes
me less enthousiatic to buy their processors.

nvidia's gpu's seem better documented especially cuda... but I am thinking
this is necessary to attract more programmers to it... cuda has little
benefits for now... but maybe that will change in future... nvidia in
bussiness with big companies: apple and microsoft and probably also google
and maybe even more... they positioned pretty well... I do wonder what will
happen to their documentation if they do make a big x86 killer... perhaps
they will become more secretive again which would suck. At least that's how
I experience it a little bit... perhaps the things I wrote about intel might
not be entirely true... but that is how I feel about the latest/greatest...
or maybe their documentation is just simply lagging behind a little bit...
or I didn't look... or it doesn't concern me yet since I don't have those
chips or proper simulators

I have been programming x86 for at least 18 years as well or so... and I
still don't have a proper x86 similator which kinda sucks !

Having one which could show "ammount of cycles" taken by programs would be
great. Quite strange that it apperently doesn't exist, or is too
****ty/complex/I can't understand it or whatever or takes too long to
execute ?! Hmm...

This is what I do like about "virtual instruction sets"... total insight
into how everything executes and such !

This probably explains why there are so many virtual instruction sets:

javascript, java, php, .net, flash, other scripting languages.

^ Sad thing is these all have security issue's so doesn't really help at
protecting computers, it takes only one hole to screw up a computer
But it does make computers run slower

Bye,
Skybuck.