If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#21
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting.("event driven memory system")
wolfgang kern wrote:
Bernhard Schornak wrote: ... L00: mov ecx,[esp] # ECX = 0x00[ESP] mov ebx,esi # EBX = ESI shl ebx,cl # EBX = EBX CL xor edx,edx # EDX = 0 mov ecx,[esp+$04] # ECX = 0x04[ESP] dec ecx # ECX - 1 test ecx,ecx # redundant jb L02 # outer loop if sign inc ecx # ECX + 1 DEC wont alter carry, so "jb" aka "jc" should be replaced by "jng"or "js". Hi! Thanks for refreshing my knowledge base ... I haven't seen much more than the steering wheel of an Atego 1222 and thousands of waybills for about nine months, now. ... @ Wolfgang: Both loops do work properly. In the worst case (value is zero), these loops count down the full 32 bit range. OTOH what I see is: dec ecx jng ... actually checks is if ecx were zero or negative before the DEC, so I'd had just test ecx,ecx jng ... ;jumps on zero- or sign- or overflow -flag as this will imply a zero detection. Right. Hence, the dec/inc pairs are redundant for checking the range of EDX and ECX. Freeing EBP as GPR allows to replace the outer loop counter [ESP +0x08] with a register. Just these three cosmetic changes saved 5 * 4,000 = 20,000 clocks... ....not a real improvement for processing a 130 MB array, randomly accessed 320,000,000 times... My suggestion to expand the 8,000 to 8,192 dwords could reduce all range checks to and ecx,0x1FFF je ... leaves a valid index in ECX, and skips processing if ECX = 0. Same with EDX (anded with 0x0FFF). And for a biased range ie: cmp ecx,3 jng ... ;jumps if ecx = 3 or less (signed) A sometimes required operation. In most cases, it is better to define a valid range and "transpose" it to something counted up or down to zero, using appropriate offsets "compensating" the transposed index. Unfortunately, there seem to be some addresses in the first elements of each block, so the properly coded loop had to check for the lower limit - the real array starts at offset 0x18 - as well. Slows down the code with two additional branches. Looks like HeLL, smells like HeLL, nua ös Design is ned goa so hell... Greetings from Augsburg Bernhard Schornak |
#22
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
Ok,
I am a bit surprised I didn't respond yet to this posting of yours but that might be because most of it is true and I have nothing further to add, except ofcourse confirming that it was indeed a miss-understanding somewhat. But it was also wishfull thinking. I was hoping that [mem + 0x04] somehow ment that it was using "mem" as a base address and that "0x04" would contain another pointer which would point to the random memory location. I am not much of an x86 programmer and each assembler probably has it's own pointer syntax/whatever. I am glad you cleared this up. I also don't hold your example against you because at the time of writing I did probably not yet explain what the program was doing. Only after your posting did I explain further... So I think it's good of me to clearify that further just to be clear that you were not trying to sabotage this thread with a wrong example. At the time you probably simply didn't understand what I was trying to do... or maybe you did understand but provided a sequential example anyway. Only you know the thruth what was in your head at the time and true intention.... though a sequential example which more or less assumes that data is inside the cache could still be interesting. So there could still be some value in this... perhaps my code is already doing something like this... but I guess not because it's not really giving any higher performance but that might be because the data set is too large. If the elements were reduced from 8000 to 800 then 3 pairs of blocks might fit in the cache and then maybe your code example might still have some value. However not really since the single program is probably already maxing out the system/cpu. It's not the cache or memory which is holding back the program it's simply the ammount of instructions. SSE is not going to help since there is no SSE instruction which can retrieve multiple random memory access locations. So to bad for that. If you truely want to write a faster program in assembler, and especially a significantly more faster program than you would either: 1. Have to reduce the number of instructions further which will probably be hard to do (inside the inner loop). or 2. Find a way to "fetch" multiple elements per cycle (per instruction). My take on it is: 1 might be possible, but 2 probably not. Also: 3. Writing 64 bit program is probably useless, since 64 bit instructions execute twice as slow, at least on my processor, perhaps you should test this on your phenom and see what happens However under different circumstances perhaps you can do better, like not-fitting-in-cache-circumstances. But this again would probably require some kind of "anti-stalling" "anti-waiting" code Which was what my original posting was more or less about... letting code proceed as much as possible and jumping back when memory results are in... something in that trend... Perhaps you are now starting to see that my posting is about something new or extra which might not be possible with current hardware, though you do keep insisting that it is possible I believe you 50% a little bit... even if it's possible it will be little, you still have to convince me for 100% though... my lack of time as still prevented me from running your program. Perhaps I don't want to know results for now since it wouldn't really be that helpfull I guess However if you could somehow re-write your program from "optimized assembler" back to "free pascal code" in such a way that the free pascal compiler produces more or less the same code then that would be highly interesting ! Especially if the code is indeed faster for certain circumstances ! It would probably still not be interesting for my current project but very maybe future projects. However I see a big mistake in your reasoning of optimizing your program, which is at the same time a big mistake in my reasoning and original post. "CUDA/GPU'S" can probably already do what I propose to a certain extent at least... and they probably do it much better than CPU... which means x86/intel/amd potentially has huge problem, since GPU can do something which their older processors apperently cannot which is: process very large data sets in a random fashion. However I am unsure about ATI cards and the newer AMD gpu's even INTEL seems to have "gpu's embedded"... I find intel kinda secretive about that... they are not coming forward with details... they probably did the same during the 80486 age... they keep "secrets" "secret" to give their selected programmers an adventage over others... not a good practice for us it seems, which makes me less enthousiatic to buy their processors. nvidia's gpu's seem better documented especially cuda... but I am thinking this is necessary to attract more programmers to it... cuda has little benefits for now... but maybe that will change in future... nvidia in bussiness with big companies: apple and microsoft and probably also google and maybe even more... they positioned pretty well... I do wonder what will happen to their documentation if they do make a big x86 killer... perhaps they will become more secretive again which would suck. At least that's how I experience it a little bit... perhaps the things I wrote about intel might not be entirely true... but that is how I feel about the latest/greatest... or maybe their documentation is just simply lagging behind a little bit... or I didn't look... or it doesn't concern me yet since I don't have those chips or proper simulators I have been programming x86 for at least 18 years as well or so... and I still don't have a proper x86 similator which kinda sucks ! Having one which could show "ammount of cycles" taken by programs would be great. Quite strange that it apperently doesn't exist, or is too ****ty/complex/I can't understand it or whatever or takes too long to execute ?! Hmm... This is what I do like about "virtual instruction sets"... total insight into how everything executes and such ! This probably explains why there are so many virtual instruction sets: javascript, java, php, .net, flash, other scripting languages. ^ Sad thing is these all have security issue's so doesn't really help at protecting computers, it takes only one hole to screw up a computer But it does make computers run slower Bye, Skybuck. |
#23
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
"
This probably explains why there are so many virtual instruction sets: javascript, java, php, .net, flash, other scripting languages. ^ Sad thing is these all have security issue's so doesn't really help at protecting computers, it takes only one hole to screw up a computer But it does make computers run slower " Oh I forgot one, which could be a pretty serious one in futu How could I forget my favorite one at the moment: "PTX" And perhaps I should also mention "redcode" but it's not really that serious more ment for fun But PTX is ofcourse serious. And there is also AMD's version "vm" or something... Which suddenly makes me wonder what intel's instruction set is for their integrated cpu+gpu thingy... hmmm Bye, Skybuck. |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Dimension 8400 w/intel 670 3.8gig processor "Thermal Event" | Brad[_3_] | Dell Computers | 44 | April 23rd 11 11:09 PM |
Idea for Quake 3/Live: "Skill Glow" | Skybuck Flying[_2_] | Nvidia Videocards | 1 | February 22nd 09 09:34 AM |
Can't "unsync" memory bus speed (A8V-E SE) | Hackworth | Asus Motherboards | 2 | September 6th 06 05:28 AM |
P5WD2-E system "hang" after memory size | [email protected] | Asus Motherboards | 12 | July 8th 06 11:24 PM |