If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
Here is an idea:
1. The computer program issue's load and store requests to the memory system. Instead of only specifieing the memory address, the computer program would also specifiy the instruction pointer. 2. The memory system fetches the memory located at the memory address and returns it but it also returns the instruction pointer. 3. Because the instruction pointer is returned, this allows the computer program to go back to the instruction that was "stalled" or "skipped". 4. This would allow the computer program to issue memory requests, but not have to wait for them. The computer program would jump back to the instruction location once it's memory has been retrieved. This would allow the computer program to simply move on/go on with executing other instructions and potentially retrieving other memory locations. For example: Load Load Load Load Load Load ^ All these instructions specify an memory address, but also specify the instruction pointer to main memory, and also specify a register where to load the memory into. The question is now: What to do with other instruction which depend on the registers to be "live" / "filled" with data. Well perhaps these instructions could all check a "flag" inside the register. If the register is "live" the instructions get executed. If the register is "dead" (/"stalled") the instruction is skipped. This would allow a programmer to program a lot of semi-parallel code which gets or gets not executed depending on if the memory came in... The program would then simply jump back at appriorate times... when the memory comes in... like an interrupt handler or like an exception handler. So this is probably a totally new instruction flow idea. Finally if the memory loading takes so long that the program cannot do anything anymore then the programmer could issue a wait: Like so: wait The processor would then either wait for all loads to complete or some other interrupts to occur. So another way to describe this system would be a really short abstract way: "An event driven memory system". Bye, Skybuck. |
#2
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
I think I now understand why cuda seems to stall on each load (I am not 100%
sure but it seems like stalling). It's probably because of it's desired "single instruction multiple data" behaviour. Cuda wants to wait until all 32 threads have their memory request forfilled/their register loaded. Only then can it continue with it's single instruction on "multiple data". However... if all threads are stalled there is no reason why not all threads could simply continue with the next instruction if it does not depend on the previous load, or if the "alive" idea is used. So again cuda could continue. I also had a samiliar idea for sse: an instruction which did multiple random access loads with a single instruction. (never published it though but there ya go). |
#3
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
"Flying Bucket" posted one more _new_ idea Here is an idea: 1. The computer program issue's load and store requests to the memory system. Instead of only specifieing the memory address, the computer program would also specifiy the instruction pointer. 2. The memory system fetches the memory located at the memory address and returns it but it also returns the instruction pointer. 3. Because the instruction pointer is returned, this allows the computer program to go back to the instruction that was "stalled" or "skipped". 4. This would allow the computer program to issue memory requests, but not have to wait for them. The computer program would jump back to the instruction location once it's memory has been retrieved. This would allow the computer program to simply move on/go on with executing other instructions and potentially retrieving other memory locations. All this issues are already covered/implemented in x86 architectures, OoO features exist there since a while (10 years). __ wolfgang |
#4
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting.("event driven memory system")
On Jul 22, 9:06*am, "wolfgang kern" wrote:
All this issues are already covered/implemented in x86 architectures, OoO features exist there since a while (10 years). Indeed. If one is going to add something to the machine language, generally speaking it would be something that allows instructions to execute fully in sequence without delays. So instead of going back to an instruction that requested data from memory, there would be instructions that made a request, and later instructions that used the data. John Savard |
#5
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting.("event driven memory system")
Quadibloc wrote:
On Jul 22, 9:06 am, "wolfgang kern" wrote: All this issues are already covered/implemented in x86 architectures, OoO features exist there since a while (10 years). Indeed. If one is going to add something to the machine language, generally speaking it would be something that allows instructions to execute fully in sequence without delays. So instead of going back to an instruction that requested data from memory, there would be instructions that made a request, and later instructions that used the data. John Savard Guess that is why they have cache ! Jamie |
#6
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
"wolfgang kern" wrote in message ... "Flying Bucket" posted one more _new_ idea Here is an idea: 1. The computer program issue's load and store requests to the memory system. Instead of only specifieing the memory address, the computer program would also specifiy the instruction pointer. 2. The memory system fetches the memory located at the memory address and returns it but it also returns the instruction pointer. 3. Because the instruction pointer is returned, this allows the computer program to go back to the instruction that was "stalled" or "skipped". 4. This would allow the computer program to issue memory requests, but not have to wait for them. The computer program would jump back to the instruction location once it's memory has been retrieved. This would allow the computer program to simply move on/go on with executing other instructions and potentially retrieving other memory locations. " All this issues are already covered/implemented in x86 architectures, OoO features exist there since a while (10 years). " I don't think so... I have never seen x86 processors execute instructions and then later go back to previous instructions I might do this underneath, but I have never seen a debugger jump back =D Bye, Skybuck |
#7
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
Or maybe I just missed the start of it... here is the start of it:
Using Inline Assembly Code (Win32 Only) The built-in assembler allows you to write assembly code within Delphi programs. The inline assembler is available only on the Win32 Delphi compiler. It has the following features: Allows for inline assembly. Supports all instructions found in the Intel Pentium 4, Intel MMX extensions, Streaming SIMD Extensions (SSE), and the AMD Athlon (including 3D Now!). Provides no macro support, but allows for pure assembly function procedures. Permits the use of Delphi identifiers, such as constants, types, and variables in assembly statements. As an alternative to the built-in assembler, you can link to object files that contain external procedures and functions. See the topic on External declarations for more information. If you have external assembly code that you want to use in your applications, you should consider rewriting it in the Delphi language or minimally reimplement it using the inline assembler. Using the asm Statement The built-in assembler is accessed through asm statements, which have the form asm statementList end where statementList is a sequence of assembly statements separated by semicolons, end-of-line characters, or Delphi comments. Comments in an asm statement must be in Delphi style. A semicolon does not indicate that the rest of the line is a comment. The reserved word inline and the directive assembler are maintained for backward compatibility only. They have no effect on the compiler. "Skybuck Flying" wrote in message news:... Here some information for you about Delphi's assembler. I think the calling convention can be "register" or simply "stdcall". This information was copied from a pdf, the copy & paste is a bit messy but might be of some use. If you have acrobot reader then you can read the pdf he http://www.win.tue.nl/~wstomv/edu/de...guageGuide.pdf Scroll down or click in index: the section is called “inline assembly code (win32 only)”. “ Understanding Assembler Syntax (Win32 Only) The inline assembler is available only on the Win32 Delphi compiler. The following material describes the elements of the assembler syntax necessary for proper use. Assembler Statement Syntax Labels Instruction Opcodes Assembly Directives Operands Assembler Statement Syntax This syntax of an assembly statement is Label: Prefix Opcode Operand1, Operand2 where Label is a label, Prefix is an assembly prefix opcode (operation code), Opcode is an assembly instruction opcode or directive, and Operand is an assembly expression. Label and Prefix are optional. Some opcodes take only one operand, and some take none. Comments are allowed between assembly statements, but not within them. For example, MOV AX,1 {Initial value} { OK } MOV CX,100 {Count} { OK } MOV {Initial value} AX,1; { Error! } MOV CX, {Count} 100 { Error! } Labels Labels are used in built-in assembly statements as they are in the Delphi languageby writing the label and a colon before a statement. There is no limit to a label's length. As in Delphi, labels must be declared in a label declaration part in the block containing the asm statement. The one exception to this rule is local labels. Local labels are labels that start with an at-sign (@). They consist of an at-sign followed by one or more letters, digits, underscores, or at-signs. Use of local labels is restricted to asm statements, and the scope of a local label extends from the asm reserved word to the end of the asm statement that contains it. A local label doesn't have to be declared. Instruction Opcodes The built-in assembler supports all of the Intel-documented opcodes for general application use. Note that operating system privileged instructions may not be supported. Specifically, the following families of instructions are supported: Pentium family Pentium Pro and Pentium II Pentium III Pentium 4 227 In addition, the built-in assembler supports the following instruction sets AMD 3DNow! (from the AMD K6 onwards) AMD Enhanced 3DNow! (from the AMD Athlon onwards) For a complete description of each instruction, refer to your microprocessor documentation. RET instruction sizing The RET instruction opcode always generates a near return. Automatic jump sizing Unless otherwise directed, the built-in assembler optimizes jump instructions by automatically selecting the shortest, and therefore most efficient, form of a jump instruction. This automatic jump sizing applies to the unconditional jump instruction (JMP), and to all conditional jump instructions when the target is a label (not a procedure or function). For an unconditional jump instruction (JMP), the built-in assembler generates a short jump (one-byte opcode followed by a one-byte displacement) if the distance to the target label is 128 to 127 bytes. Otherwise it generates a near jump (one-byte opcode followed by a two-byte displacement). For a conditional jump instruction, a short jump (one-byte opcode followed by a one-byte displacement) is generated if the distance to the target label is 128 to 127 bytes. Otherwise, the built-in assembler generates a short jump with the inverse condition, which jumps over a near jump to the target label (five bytes in total). For example, the assembly statement JC Stop where Stop isn't within reach of a short jump, is converted to a machine code sequence that corresponds to this: JNC Skip JMP Stop Skip: Jumps to the entry points of procedures and functions are always near. Assembly Directives The built-in assembler supports three assembly define directives: DB (define byte), DW (define word), and DD (define double word). Each generates data corresponding to the comma-separated operands that follow the directive. The DB directive generates a sequence of bytes. Each operand can be a constant expression with a value between 128 and 255, or a character string of any length. Constant expressions generate one byte of code, and strings generate a sequence of bytes with values corresponding to the ASCII code of each character. The DW directive generates a sequence of words. Each operand can be a constant expression with a value between 32,768 and 65,535, or an address expression. For an address expression, the built-in assembler generates a near pointer, a word that contains the offset part of the address. The DD directive generates a sequence of double words. Each operand can be a constant expression with a value between 2,147,483,648 and 4,294,967,295, or an address expression. For an address expression, the built-in assembler generates a far pointer, a word that contains the offset part of the address, followed by a word that contains the segment part of the address. The DQ directive defines a quad word for Int64 values. 228 The data generated by the DB, DW, and DD directives is always stored in the code segment, just like the code generated by other built-in assembly statements. To generate uninitialized or initialized data in the data segment, you should use Delphi var or const declarations. Some examples of DB, DW, and DD directives follow. asm DB FFH { One byte } DB 0,99 { Two bytes } DB 'A' { Ord('A') } DB 'Hello world...',0DH,0AH { String followed by CR/LF } DB 12,'string' { Delphi style string } DW 0FFFFH { One word } DW 0,9999 { Two words } DW 'A' { Same as DB 'A',0 } DW 'BA' { Same as DB 'A','B' } DW MyVar { Offset of MyVar } DW MyProc { Offset of MyProc } DD 0FFFFFFFFH { One double-word } DD 0,999999999 { Two double-words } DD 'A' { Same as DB 'A',0,0,0 } DD 'DCBA' { Same as DB 'A','B','C','D' } DD MyVar { Pointer to MyVar } DD MyProc { Pointer to MyProc } end; When an identifier precedes a DB, DW , or DD directive, it causes the declaration of a byte-, word-, or double-wordsized variable at the location of the directive. For example, the assembler allows the following: ByteVar DB ? WordVar DW ? IntVar DD ? .. .. .. MOV AL,ByteVar MOV BX,WordVar MOV ECX,IntVar The built-in assembler doesn't support such variable declarations. The only kind of symbol that can be defined in an inline assembly statement is a label. All variables must be declared using Delphi syntax; the preceding construction can be replaced by var 229 ByteVar: Byte; WordVar: Word; IntVar: Integer; .. .. .. asm MOV AL,ByteVar MOV BX,WordVar MOV ECX,IntVar end; SMALL and LARGE can be used determine the width of a displacement: MOV EAX, [LARGE $1234] This instruction generates a 'normal' move with a 32-bit displacement ($00001234). MOV EAX, [SMALL $1234] The second instruction will generate a move with an address size override prefix and a 16-bit displacement ($1234). SMALL can be used to save space. The following example generates an address size override and a 2-byte address (in total three bytes) MOV EAX, [SMALL 123] as opposed to MOV EAX, [123] which will generate no address size override and a 4-byte address (in total four bytes). Two additional directives allow assembly code to access dynamic and virtual methods: VMTOFFSET and DMTINDEX. VMTOFFSET retrieves the offset in bytes of the virtual method pointer table entry of the virtual method argument from the beginning of the virtual method table (VMT). This directive needs a fully specified class name with a method name as a parameter (for example, TExample.VirtualMethod), or an interface name and an interface method name. DMTINDEX retrieves the dynamic method table index of the passed dynamic method. This directive also needs a fully specified class name with a method name as a parameter, for example, TExample.DynamicMethod. To invoke the dynamic method, call System.@CallDynaInst with the (E)SI register containing the value obtained from DMTINDEX. Note: Methods with the message directive are implemented as dynamic methods and can also be called using the DMTINDEX technique. For example: TMyClass = class procedure x; message MYMESSAGE; end; The following example uses both DMTINDEX and VMTOFFSET to access dynamic and virtual methods: 230 program Project2; type TExample = class procedure DynamicMethod; dynamic; procedure VirtualMethod; virtual; end; procedure TExample.DynamicMethod; begin end; procedure TExample.VirtualMethod; begin end; procedure CallDynamicMethod(e: TExample); asm // Save ESI register PUSH ESI // Instance pointer needs to be in EAX MOV EAX, e // DMT entry index needs to be in (E)SI MOV ESI, DMTINDEX TExample.DynamicMethod // Now call the method CALL System.@CallDynaInst // Restore ESI register POP ESI end; procedure CallVirtualMethod(e: TExample); asm // Instance pointer needs to be in EAX MOV EAX, e // Retrieve VMT table entry MOV EDX, [EAX] // Now call the method at offset VMTOFFSET CALL DWORD PTR [EDX + VMTOFFSET TExample.VirtualMethod] end; var e: TExample; begin e := TExample.Create; try CallDynamicMethod(e); CallVirtualMethod(e); finally e.Free; end; 231 end. Operands Inline assembler operands are expressions that consist of constants, registers, symbols, and operators. Within operands, the following reserved words have predefined meanings: Built-in assembler reserved words AH CL DX ESP mm4 SHL WORD AL CS EAX FS mm5 SHR xmm0 AND CX EBP GS mm6 SI xmm1 AX DH EBX HIGH mm7 SMALL xmm2 BH DI ECX LARGE MOD SP xmm3 BL DL EDI LOW NOT SS xmm4 BP CL EDX mm0 OFFSET ST xmm5 BX DMTINDEX EIP mm1 OR TBYTE xmm6 BYTE DS ES mm2 PTR TYPE xmm7 CH DWORD ESI mm3 QWORD VMTOFFSET XOR Reserved words always take precedence over user-defined identifiers. For example, var Ch: Char; .. .. .. asm MOV CH, 1 end; loads 1 into the CH register, not into the Ch variable. To access a user-defined symbol with the same name as a reserved word, you must use the ampersand (&) override operator: MOV&Ch, 1 It is best to avoid user-defined identifiers with the same names as built-in assembler reserved words. 232 Assembly Expressions (Win32 Only) The built-in assembler evaluates all expressions as 32-bit integer values. It doesn't support floating-point and string values, except string constants. The inline assembler is available only on the Win32 Delphi compiler. Expressions are built from expression elements and operators, and each expression has an associated expression class and expression type. This topic covers the following material: Differences between Delphi and Assembler Expressions Expression Elements Expression Classes Expression Types Expression Operators Differences between Delphi and Assembler Expressions The most important difference between Delphi expressions and built-in assembler expressions is that assembler expressions must resolve to a constant value. In other words, it must resolve to a value that can be computed at compile time. For example, given the declarations const X = 10; Y = 20; var Z: Integer; the following is a valid statement. asm MOV Z,X+Y end; Because both X and Y are constants, the expression X + Y is a convenient way of writing the constant 30, and the resulting instruction simply moves of the value 30 into the variable Z. But if X and Y are variables var X, Y: Integer; the built-in assembler cannot compute the value of X + Y at compile time. In this case, to move the sum of X and Y into Z you would use asm MOV EAX,X ADD EAX,Y MOV Z,EAX end; In a Delphi expression, a variable reference denotes the contents of the variable. But in an assembler expression, a variable reference denotes the address of the variable. In Delphi the expression X + 4 (where X is a variable) 233 means the contents of X plus 4, while to the built-in assembler it means the contents of the word at the address four bytes higher than the address of X. So, even though you are allowed to write asm MOV EAX,X+4 end; this code doesn't load the value of X plus 4 into AX; instead, it loads the value of a word stored four bytes beyond X. The correct way to add 4 to the contents of X is asm MOV EAX,X ADD EAX,4 end; Expression Elements The elements of an expression are constants, registers, and symbols. Numeric Constants Numeric constants must be integers, and their values must be between 2,147,483,648 and 4,294,967,295. By default, numeric constants use decimal notation, but the built-in assembler also supports binary, octal, and hexadecimal. Binary notation is selected by writing a B after the number, octal notation by writing an O after the number, and hexadecimal notation by writing an H after the number or a $ before the number. Numeric constants must start with one of the digits 0 through 9 or the $ character. When you write a hexadecimal constant using the H suffix, an extra zero is required in front of the number if the first significant digit is one of the digits A through F. For example, 0BAD4H and $BAD4 are hexadecimal constants, but BAD4H is an identifier because it starts with a letter. String Constants String constants must be enclosed in single or double quotation marks. Two consecutive quotation marks of the same type as the enclosing quotation marks count as only one character. Here are some examples of string constants: 'Z' 'Delphi' 'Linux' "That's all folks" '"That''s all folks," he said.' '100' '"' "'" String constants of any length are allowed in DB directives, and cause allocation of a sequence of bytes containing the ASCII values of the characters in the string. In all other cases, a string constant can be no longer than four characters and denotes a numeric value which can participate in an expression. The numeric value of a string constant is calculated as 234 Ord(Ch1) + Ord(Ch2) shl 8 + Ord(Ch3) shl 16 + Ord(Ch4) shl 24 where Ch1 is the rightmost (last) character and Ch4 is the leftmost (first) character. If the string is shorter than four characters, the leftmost characters are assumed to be zero. The following table shows string constants and their numeric values. String examples and their values String Value 'a' 00000061H 'ba' 00006261H 'cba' 00636261H 'dcba' 64636261H 'a ' 00006120H ' a' 20202061H 'a' * 2 000000E2H 'a'-'A' 00000020H not 'a' FFFFFF9EH Registers The following reserved symbols denote CPU registers in the inline assembler: CPU registers 32-bit general purpose EAX EBX ECX EDX 32-bit pointer or index ESP EBP ESI EDI 16-bit general purpose AX BX CX DX 16-bit pointer or index SP BP SI DI 8-bit low registers AL BL CL DL 16-bit segment registers CS DS SS ES 32-bit segment registers FS GS 8-bit high registers AH BH CH DH Coprocessor register stack ST When an operand consists solely of a register name, it is called a register operand. All registers can be used as register operands, and some registers can be used in other contexts. The base registers (BX and BP) and the index registers (SI and DI) can be written within square brackets to indicate indexing. Valid base/index register combinations are [BX], [BP], [SI], [DI], [BX+SI], [BX+DI], [BP+SI], and [BP+DI]. You can also index with all the 32-bit registersfor example, [EAX+ECX], [ESP], and [ESP+EAX+5]. The segment registers (ES, CS, SS, DS, FS, and GS) are supported, but segments are normally not useful in 32- bit applications. The symbol ST denotes the topmost register on the 8087 floating-point register stack. Each of the eight floatingpoint registers can be referred to using ST(X), where X is a constant between 0 and 7 indicating the distance from the top of the register stack. Symbols The built-in assembler allows you to access almost all Delphi identifiers in assembly language expressions, including constants, types, variables, procedures, and functions. In addition, the built-in assembler implements the special symbol @Result, which corresponds to the Result variable within the body of a function. For example, the function 235 function Sum(X, Y: Integer): Integer; begin Result := X + Y; end; could be written in assembly language as function Sum(X, Y: Integer): Integer; stdcall; begin asm MOV EAX,X ADD EAX,Y MOV @Result,EAX end; end; The following symbols cannot be used in asm statements: Standard procedures and functions (for example, WriteLn and Chr). String, floating-point, and set constants (except when loading registers). Labels that aren't declared in the current block. The @Result symbol outside of functions. The following table summarizes the kinds of symbol that can be used in asm statements. Symbols recognized by the built-in assembler Symbol Value Class Type Label Address of label Memory reference Size of type Constant Value of constant Immediate value 0 Type 0 Memory reference Size of type Field Offset of field Memory Size of type Variable Address of variable or address of a pointer to the variable Memory reference Size of type Procedure Address of procedure Memory reference Size of type Function Address of function Memory reference Size of type Unit 0 Immediate value 0 @Result Result variable offset Memory reference Size of type With optimizations disabled, local variables (variables declared in procedures and functions) are always allocated on the stack and accessed relative to EBP, and the value of a local variable symbol is its signed offset from EBP. The assembler automatically adds [EBP] in references to local variables. For example, given the declaration var Count: Integer; within a function or procedure, the instruction MOV EAX,Count assembles into MOV EAX,[EBP4]. 236 The built-in assembler treats var parameters as a 32-bit pointers, and the size of a var parameter is always 4. The syntax for accessing a var parameter is different from that for accessing a value parameter. To access the contents of a var parameter, you must first load the 32-bit pointer and then access the location it points to. For example, function Sum(var X, Y: Integer): Integer; stdcall; begin asm MOV EAX,X MOV EAX,[EAX] MOV EDX,Y ADD EAX,[EDX] MOV @Result,EAX end; end; Identifiers can be qualified within asm statements. For example, given the declarations type TPoint = record X, Y: Integer; end; TRect = record A, B: TPoint; end; var P: TPoint; R: TRect; the following constructions can be used in an asm statement to access fields. MOV EAX,P.X MOV EDX,P.Y MOV ECX,R.A.X MOV EBX,R.B.Y A type identifier can be used to construct variables on the fly. Each of the following instructions generates the same machine code, which loads the contents of [EDX] into EAX. MOV EAX,(TRect PTR [EDX]).B.X MOV EAX,TRect([EDX]).B.X MOV EAX,TRect[EDX].B.X MOV EAX,[EDX].TRect.B.X Expression Classes The built-in assembler divides expressions into three classes: registers, memory references, and immediate values. An expression that consists solely of a register name is a register expression. Examples of register expressions are AX, CL, DI, and ES. Used as operands, register expressions direct the assembler to generate instructions that operate on the CPU registers. Expressions that denote memory locations are memory references. Delphi's labels, variables, typed constants, procedures, and functions belong to this category. 237 Expressions that aren't registers and aren't associated with memory locations are immediate values. This group includes Delphi's untyped constants and type identifiers. Immediate values and memory references cause different code to be generated when used as operands. For example, const Start = 10; var Count: Integer; .. .. .. asm MOV EAX,Start { MOV EAX,xxxx } MOV EBX,Count { MOV EBX,[xxxx] } MOV ECX,[Start] { MOV ECX,[xxxx] } MOV EDX,OFFSET Count { MOV EDX,xxxx } end; Because Start is an immediate value, the first MOV is assembled into a move immediate instruction. The second MOV, however, is translated into a move memory instruction, as Count is a memory reference. In the third MOV, the brackets convert Start into a memory reference (in this case, the word at offset 10 in the data segment). In the fourth MOV, the OFFSET operator converts Count into an immediate value (the offset of Count in the data segment). The brackets and OFFSET operator complement each other. The following asm statement produces identical machine code to the first two lines of the previous asm statement. asm MOV EAX,OFFSET [Start] MOV EBX,[OFFSET Count] end; Memory references and immediate values are further classified as either relocatable or absolute. Relocation is the process by which the linker assigns absolute addresses to symbols. A relocatable expression denotes a value that requires relocation at link time, while an absolute expression denotes a value that requires no such relocation. Typically, expressions that refer to labels, variables, procedures, or functions are relocatable, since the final address of these symbols is unknown at compile time. Expressions that operate solely on constants are absolute. The built-in assembler allows you to carry out any operation on an absolute value, but it restricts operations on relocatable values to addition and subtraction of constants. Expression Types Every built-in assembler expression has a typeor, more correctly, a size, because the assembler regards the type of an expression simply as the size of its memory location. For example, the type of an Integer variable is four, because it occupies 4 bytes. The built-in assembler performs type checking whenever possible, so in the instructions var QuitFlag: Boolean; OutBufPtr: Word; .. .. .. asm 238 MOV AL,QuitFlag MOV BX,OutBufPtr end; the assembler checks that the size of QuitFlag is one (a byte), and that the size of OutBufPtr is two (a word). The instruction MOV DL,OutBufPtr produces an error because DL is a byte-sized register and OutBufPtr is a word. The type of a memory reference can be changed through a typecast; these are correct ways of writing the previous instruction: MOV DL,BYTE PTR OutBufPtr MOV DL,Byte(OutBufPtr) MOV DL,OutBufPtr.Byte These MOV instructions all refer to the first (least significant) byte of the OutBufPtr variable. In some cases, a memory reference is untyped. One example is an immediate value (Buffer) enclosed in square brackets: procedure Example(var Buffer); asm MOV AL, [Buffer] MOV CX, [Buffer] MOV EDX, [Buffer] end; The built-in assembler permits these instructions, because the expression [Buffer] has no typeit just means "the contents of the location indicated by Buffer," and the type can be determined from the first operand (byte for AL, word for CX, and double-word for EDX. In cases where the type can't be determined from another operand, the built-in assembler requires an explicit typecast. For example, INC BYTE PTR [ECX] IMUL WORD PTR [EDX] The following table summarizes the predefined type symbols that the built-in assembler provides in addition to any currently declared Delphi types. Predefined type symbols Symbol Type BYTE 1 WORD 2 DWORD 4 QWORD 8 TBYTE 10 239 Expression Operators The built-in assembler provides a variety of operators. Precedence rules are different from that of the Delphi language; for example, in an asm statement, AND has lower precedence than the addition and subtraction operators. The following table lists the built-in assembler's expression operators in decreasing order of precedence. Precedence of built-in assembler expression operators Operators Remarks Precedence & highest (... ), [... ],., HIGH, LOW +, - unary + and - : OFFSET, TYPE, PTR, *, /, MOD, SHL, SHR, +, - binary + and - NOT, AND, OR, XOR lowest The following table defines the built-in assembler's expression operators. Definitions of built-in assembler expression operators Operator Description & Identifier override. The identifier immediately following the ampersand is treated as a user-defined symbol, even if the spelling is the same as a built-in assembler reserved symbol. (... ) Subexpression. Expressions within parentheses are evaluated completely prior to being treated as a single expression element. Another expression can precede the expression within the parentheses; the result in this case is the sum of the values of the two expressions, with the type of the first expression. [... ] Memory reference. The expression within brackets is evaluated completely prior to being treated as a single expression element. Another expression can precede the expression within the brackets; the result in this case is the sum of the values of the two expressions, with the type of the first expression. The result is always a memory reference. .. Structure member selector. The result is the sum of the expression before the period and the expression after the period, with the type of the expression after the period. Symbols belonging to the scope identified by the expression before the period can be accessed in the expression after the period. HIGH Returns the high-order 8 bits of the word-sized expression following the operator. The expression must be an absolute immediate value. LOW Returns the low-order 8 bits of the word-sized expression following the operator. The expression must be an absolute immediate value. + Unary plus. Returns the expression following the plus with no changes. The expression must be an absolute immediate value. - Unary minus. Returns the negated value of the expression following the minus. The expression must be an absolute immediate value. + Addition. The expressions can be immediate values or memory references, but only one of the expressions can be a relocatable value. If one of the expressions is a relocatable value, the result is also a relocatable value. If either of the expressions is a memory reference, the result is also a memory reference. - Subtraction. The first expression can have any class, but the second expression must be an absolute immediate value. The result has the same class as the first expression. : Segment override. Instructs the assembler that the expression after the colon belongs to the segment given by the segment register name (CS, DS, SS, FS, GS, or ES) before the colon. The result is a memory reference with the value of the expression after the colon. When a segment override is used in an instruction operand, the instruction is prefixed with an appropriate segment-override prefix instruction to ensure that the indicated segment is selected. OFFSET Returns the offset part (double word) of the expression following the operator. The result is an immediate value. 240 TYPE Returns the type (size in bytes) of the expression following the operator. The type of an immediate value is 0. PTR Typecast operator. The result is a memory reference with the value of the expression following the operator and the type of the expression in front of the operator. * Multiplication. Both expressions must be absolute immediate values, and the result is an absolute immediate value. / Integer division. Both expressions must be absolute immediate values, and the result is an absolute immediate value. MOD Remainder after integer division. Both expressions must be absolute immediate values, and the result is an absolute immediate value. SHL Logical shift left. Both expressions must be absolute immediate values, and the result is an absolute immediate value. SHR Logical shift right. Both expressions must be absolute immediate values, and the result is an absolute immediate value. NOT Bitwise negation. The expression must be an absolute immediate value, and the result is an absolute immediate value. AND Bitwise AND. Both expressions must be absolute immediate values, and the result is an absolute immediate value. OR Bitwise OR. Both expressions must be absolute immediate values, and the result is an absolute immediate value. XOR Bitwise exclusive OR. Both expressions must be absolute immediate values, and the result is an absolute immediate value. 241 Assembly Procedures and Functions (Win32 Only) You can write complete procedures and functions using inline assembly language code, without including a begin...end statement. This topic covers these issues: Compiler Optimizations. Function Results. The inline assembler is available only on the Win32 Delphi compiler. Compiler Optimizations An example of the type of function you can write is as follows: function LongMul(X, Y: Integer): Longint; asm MOV EAX,X IMUL Y end; The compiler performs several optimizations on these routines: No code is generated to copy value parameters into local variables. This affects all string-type value parameters and other value parameters whose size isn't 1, 2, or 4 bytes. Within the routine, such parameters must be treated as if they were var parameters. Unless a function returns a string, variant, or interface reference, the compiler doesn't allocate a function result variable; a reference to the @Result symbol is an error. For strings, variants, and interfaces, the caller always allocates an @Result pointer. The compiler only generates stack frames for nested routines, for routines that have local parameters, or for routines that have parameters on the stack. Locals is the size of the local variables and Params is the size of the parameters. If both Locals and Params are zero, there is no entry code, and the exit code consists simply of a RET instruction. The automatically generated entry and exit code for the routine looks like this: PUSH EBP ;Present if Locals 0 or Params 0 MOV EBP,ESP ;Present if Locals 0 or Params 0 SUB ESP,Locals ;Present if Locals 0 .... MOV ESP,EBP ;Present if Locals 0 POP EBP ;Present if Locals 0 or Params 0 RET Params ;Always present If locals include variants, long strings, or interfaces, they are initialized to zero but not finalized. Function Results Assembly language functions return their results as follows. Ordinal values are returned in AL (8-bit values), AX (16-bit values), or EAX (32-bit values). Real values are returned in ST(0) on the coprocessor's register stack. (Currency values are scaled by 10000.) 242 Pointers, including long strings, are returned in EAX. Short strings and variants are returned in the temporary location pointed to by @Result. " Bye, Skybuck. |
#8
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
"Skybuck Flying" wrote in message news:... Also in case you have troubles understanding the pascal code then I shall try and write down some simply and easy to understand pseudo code and some explanations/comments, let's see: What the code does and assumes is the following: general description, this information not that important but might give you the general idea: "Memory" is a pointer towards a big linear/sequentially memory block of bytes/integers. The memory block contains many integers. These integers are "virtually" subdivided into seperate blocks. Each block contains 8000 integers. This is indicated by ElementCount, so ElementCount equals 8000. There are 4000 blocks. This is indicated by BlockCount, so BlockCount equals 4000. So in total there are 4000 blocks * 8000 elements * 4 bytes = xxxx bytes of memory. Each block is like a chain of indexes. Each index point towards the next index within it's block. The indexes are thrown around inside the block during initialisation to create a random access pattern. This is also known as the "pointer chasing problem" so it could also be called the "index chasing problem". Index - Index - Index - Index - Index and so forth. The first index must be retrieved from memory, only then does one know where the next index is located and so on. There is an ElementIndex variable which is initialized to zero. So each ElementIndex simply starts at element 0 of block 0. There could be 8000 ElementIndexes in parallel all starting at element 0 of block X. Or there could simply be on ElementIndex and process each block in turn. in version 0.03 "BlockBase" was introduced. BlockBase as an index which points towards the first element of a block. So it's a storage to prevent multiplications all the time. In the code below 3 blocks in parallel are attempted. Now for some further pseudo code: // routine // variables: // each block is processed repeatedly loop count times. // loop index indicates the current loop/round LoopIndex BlockIndexA // indicates block a number BlockIndexB // indicates block b number BlockIndexC // indicates block c number ElementIndexA // indicates block a element index ElementIndexB // indicates block b element index ElementIndexC // indicates block c element index ElementCount // number of elements per block BlockCount // number of blocks LoopCount // number of loops/chases per block BlockBaseA // starting index of block a BlockBaseB // starting index of block b BlockBaseC // starting index of block c Memory // contains all integers of all blocks of all elements. So it's an array/block of elements/integers. RoutineBegin ElementCount = 8000 BlockCount = 4000 LoopCount = 80000 BlockIndexA = 0 BlockIndexB = 1 BlockIndexC = 2 // loop over all blocks, process 3 blocks per loop iteration. // slight correction: // wrong: "BlockIndex goes from 0 to 7999 divided over 3 indexes" // somewhat correct: "BlockIndex goes from 0 to 3999 divided over 3 indexes" // so BlockIndexA is 0, 3, 6, 9, 12, etc // so BlockIndexB is 1, 4, 7, 10, 13, etc // so BlockIndexC is 2, 5, 8, 11, 14, etc FirstLoopBegin // calculate the starting index of each block. // formula is: Base/Index/Offset = block number * number of elements per block. BlockBaseA = BlockIndexA * vElementCount; BlockBaseB = BlockIndexB * vElementCount; BlockBaseC = BlockIndexC * vElementCount; // initialise each element index to the first index of the block. ElementIndexA = 0 ElementIndexB = 0 ElementIndexC = 0 // loop 80000 times through the block arrays/elements SecondLoopBegin "LoopIndex goes from 0 to 79999" // Seek into memory at location BlockBase + ElementIndex // do this for A, B and C: // retrieve the new element index via the old element index ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ] ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ] ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ] SecondLoopEnd // store some bull**** in block result, this step could be skipped. // but let's not so compiler doesn't optimize anything away // perhaps block result should be printed just in case. // the last retrieved index is store into it. // do this for A,B,C BlockResult[ BlockIndexA ] = vElementIndexA BlockResult[ BlockIndexB ] = vElementIndexB BlockResult[ BlockIndexC ] = vElementIndexC // once the 3 blocks have been processed go to the next 3 blocks BlockIndexA = vBlockIndexA + 3 BlockIndexB = vBlockIndexB + 3 BlockIndexC = vBlockIndexC + 3 FirstLoopEnd Perhaps this clearifies it a little bit. Let me know if you have any further questions Bye, Skybuck. |
#9
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting.("event driven memory system")
Skybuck Flying wrote:
"Skybuck Flying" wrote in message news:... Also in case you have troubles understanding the pascal code then I shall try and write down some simply and easy to understand pseudo code and some explanations/comments, let's see: What the code does and assumes is the following: general description, this information not that important but might give you the general idea: "Memory" is a pointer towards a big linear/sequentially memory block of bytes/integers. The memory block contains many integers. These integers are "virtually" subdivided into seperate blocks. Each block contains 8000 integers. This is indicated by ElementCount, so ElementCount equals 8000. There are 4000 blocks. This is indicated by BlockCount, so BlockCount equals 4000. So in total there are 4000 blocks * 8000 elements * 4 bytes = xxxx bytes of memory. Each block is like a chain of indexes. Each index point towards the next index within it's block. The indexes are thrown around inside the block during initialisation to create a random access pattern. This is also known as the "pointer chasing problem" so it could also be called the "index chasing problem". Index - Index - Index - Index - Index and so forth. The first index must be retrieved from memory, only then does one know where the next index is located and so on. There is an ElementIndex variable which is initialized to zero. So each ElementIndex simply starts at element 0 of block 0. There could be 8000 ElementIndexes in parallel all starting at element 0 of block X. Or there could simply be on ElementIndex and process each block in turn. in version 0.03 "BlockBase" was introduced. BlockBase as an index which points towards the first element of a block. So it's a storage to prevent multiplications all the time. 4000 is 0x0FA0, 8000 is 0x1F40. If we expand these sizes to 4096 and 8192, we get ranges 0x0000...0x0FFF(0x1FFF). This allows simple bit masks to sort out invalid indices and simplifies address calculations markably. If we see the entire thing as an array with 4096 pages 8192 elements, we just have to create a large array with a size of 4096 * 8192 * 4 = 134,217,728 byte. In the code below 3 blocks in parallel are attempted. A very bad idea. L1 cache is 65,536 byte on recent CPUs, but only 32,768 byte on future processors. e.g. Zambezi. Three blocks occupy 3 * 32,768 = 98,304 byte, so: 32,768 (or 65,536!) byte do not fit into L1. The result: Memory access is slowed down from 3 (L1) to 6 (L2), 12 (L3) or 27 (RAM) clocks per access, depending on where the cache lines have to be fetched from or written to. If you want to process multiple blocks in parallel, it's better to check, how many physical processors exist on a user's machine, then create that amount threads (minus 1 for the OS) running on one physical processor, each. Now for some further pseudo code: // routine // variables: // each block is processed repeatedly loop count times. // loop index indicates the current loop/round LoopIndex BlockIndexA // indicates block a number BlockIndexB // indicates block b number BlockIndexC // indicates block c number ElementIndexA // indicates block a element index ElementIndexB // indicates block b element index ElementIndexC // indicates block c element index ElementCount // number of elements per block BlockCount // number of blocks LoopCount // number of loops/chases per block BlockBaseA // starting index of block a BlockBaseB // starting index of block b BlockBaseC // starting index of block c Memory // contains all integers of all blocks of all elements. So it's an array/block of elements/integers. RoutineBegin ElementCount = 8000 BlockCount = 4000 LoopCount = 80000 BlockIndexA = 0 BlockIndexB = 1 BlockIndexC = 2 // loop over all blocks, process 3 blocks per loop iteration. // slight correction: // wrong: "BlockIndex goes from 0 to 7999 divided over 3 indexes" // somewhat correct: "BlockIndex goes from 0 to 3999 divided over 3 indexes" // so BlockIndexA is 0, 3, 6, 9, 12, etc // so BlockIndexB is 1, 4, 7, 10, 13, etc // so BlockIndexC is 2, 5, 8, 11, 14, etc FirstLoopBegin // calculate the starting index of each block. // formula is: Base/Index/Offset = block number * number of elements per block. BlockBaseA = BlockIndexA * vElementCount; BlockBaseB = BlockIndexB * vElementCount; BlockBaseC = BlockIndexC * vElementCount; With 8,192 elements per block, the multiplication can be replaced by a shift: BlockBaseN = BlockIndexN 15 reducing the VectorPath MUL (blocking all pipes for five clocks) to a DirectPath SHL (one clock, one pipe). // initialise each element index to the first index of the block. ElementIndexA = 0 ElementIndexB = 0 ElementIndexC = 0 // loop 80000 times through the block arrays/elements SecondLoopBegin "LoopIndex goes from 0 to 79999" // Seek into memory at location BlockBase + ElementIndex // do this for A, B and C: // retrieve the new element index via the old element index ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ] ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ] ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ] SecondLoopEnd I still do not understand what this is good for, but you should consider to process one block after the other. If you are eager to introduce something like "parallel pro- cessing", divide your entire array into as much parts as there are processors (hopefully a multiple of two), then let those threads process a contiguous part of memory in parallel. If you prefer a single threaded solution, it's better to process one block after the other, because entire blocks fit into L1 completely. Any "touched" element is present in L1 after the first access, three registers are enough to manage the entire loop, faster algorithms can be used to process multiple elements in parallel (all pipes busy most of the time), and so on, // store some bull**** in block result, this step could be skipped. // but let's not so compiler doesn't optimize anything away // perhaps block result should be printed just in case. // the last retrieved index is store into it. // do this for A,B,C BlockResult[ BlockIndexA ] = vElementIndexA BlockResult[ BlockIndexB ] = vElementIndexB BlockResult[ BlockIndexC ] = vElementIndexC // once the 3 blocks have been processed go to the next 3 blocks BlockIndexA = vBlockIndexA + 3 BlockIndexB = vBlockIndexB + 3 BlockIndexC = vBlockIndexC + 3 FirstLoopEnd Perhaps this clearifies it a little bit. Let me know if you have any further questions As shown, your current approach is quite slow - it might be worth a thought to change your current design towards something making use of built-in acceleration mechanisms like caching and multiple execution pipelines. Using multiples of two (rather than ten) simplifies cal- culations markably. Hence, the first thing to change was the size of your array - 4,096 blocks and 8,192 elements don't occupy that much additional memory, 6,217,728 byte are peanuts compared to the entire array. In return, you save a lot of time wasted with calculating addresses and validating indices (AND automatically keeps your indices within the valid range). The slightly changed design used addressing modes like mov REG,block_offset[block_base + index * 4] or mov block_offset[block_base + index * 4],REG allowing to step through the entire array with some very simple (and fast) additions and shifts. For example, the given function required one register for block_base plus block_offset. This had to be updated only once per outer loop, while the inner loop just fetches indexed elements and writes them back. Reducing address calculations to a single register, 12 registers are left for the real work (two registers are required as loop counters). Greetings from Augsburg Bernhard Schornak |
#10
|
|||
|
|||
An idea how to speed up computer programs and avoid waiting. ("event driven memory system")
"
4000 is 0x0FA0, 8000 is 0x1F40. " Ok this could be handy to know to spot these constants/parameters potentially in your code. " If we expand these sizes to 4096 and 8192, we get ranges 0x0000...0x0FFF(0x1FFF). This allows simple bit masks to sort out invalid indices and simplifies address calculations markably. " Why would you want to spot invalid indices ? At least my pascal code seems already ok. Perhaps this is to help you debug your assembler code ? Address calculations should be possible for any setting, this allows flexibility in testing different scenerio's. " If we see the entire thing as an array with 4096 pages 8192 elements, we just have to create a large array with a size of 4096 * 8192 * 4 = 134,217,728 byte. " Irrelevant any size can be allocated easily, at least in a high level language, perhaps this helps you in assembler ? In the code below 3 blocks in parallel are attempted. " A very bad idea. " This was your idea after all It might be bad for current settings, but maybe there is some value in it for other settings. " L1 cache is 65,536 byte on recent CPUs, but only 32,768 byte on future processors. e.g. Zambezi. Three blocks occupy 3 * 32,768 = 98,304 byte, so: 32,768 (or 65,536!) byte do not fit into L1. The result: Memory access is slowed down from 3 (L1) to 6 (L2), 12 (L3) or 27 (RAM) clocks per access, depending on where the cache lines have to be fetched from or written to. If you want to process multiple blocks in parallel, it's better to check, how many physical processors exist on a user's machine, then create that amount threads (minus 1 for the OS) running on one physical processor, each. " So far the examples have all executed well inside the cache because of the low settings which almost fit in L1 cache. However it would also be interesting to change settings to for example a block size of 1920x1200 elements to mimic a single frame of a video codec for example, which might want to do some random access memory. So then the L1 cache will probably be whoefully short and then you can still try and see if your "pairing/parallel" processing of frames might give higher throughput. However I already do see a little problem with this idea. Usually video codec frames are processed sequentially, so it might be difficult to impossible to process multiple at the same time. However a frame does consists out of r,g,b and sometimes maybe even alpha channel. So there could be 3 to 4 channels, so your idea of processing 3 blocks at the same time might still have some merit, when each color channel is considered a block. Another interesting setting could be to reduce the settings to make it fit entirely into the L1 cache to see if your pairing/multiple loading code also becomes faster for when it does fit in cache. Writing multi threading code for "random access memory" could become tricky. Especially if writes are involved as well. Processors might conflict with each other. As long as processors operate within their own blocks and can never access each other blocks then it should be ok. Some care should still be taken at the border of the blocks, inserting some paddings there could help prevent accidently/unlucky simultanious access. So it does become a little bit more tricky to write code like that... because of the padding, allocations have to be done a little bit different, as well as formula's. I can see it become a bit messy though some variable names could help solve the mess as follows: Allocate( mElementCount + mPaddingCount ); ^ Then padding count could be used in formula's/code where necessary or left out where not necessary. Like mod mElementCount to setup indexes correctly. " With 8,192 elements per block, the multiplication can be replaced by a shift: BlockBaseN = BlockIndexN 15 reducing the VectorPath MUL (blocking all pipes for five clocks) to a DirectPath SHL (one clock, one pipe). " While to explanation of this optimization idea/trick is interesting unfortunately it cannot be really used in the example code, The example code needs to remain flexible to try out different scenerio's to see if your idea can be usefull for something // initialise each element index to the first index of the block. ElementIndexA = 0 ElementIndexB = 0 ElementIndexC = 0 // loop 80000 times through the block arrays/elements SecondLoopBegin "LoopIndex goes from 0 to 79999" // Seek into memory at location BlockBase + ElementIndex // do this for A, B and C: // retrieve the new element index via the old element index ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ] ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ] ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ] SecondLoopEnd " I still do not understand what this is good for, but you should consider to process one block after the other. If you are eager to introduce something like "parallel pro- cessing", divide your entire array into as much parts as there are processors (hopefully a multiple of two), then let those threads process a contiguous part of memory in parallel. " The code above represents your "parallel block" idea, where there would be 3 loads in parallel. The mMemory[ ... ] is a load. See one of your earlier postings of yourself where you write about that idea. You "claimed" the X2 processor can do 3 loads in parallel more or less. The code above tries to take adventage of that. " If you prefer a single threaded solution, it's better to process one block after the other, because entire blocks fit into L1 completely. " Perhaps your parallel loading idea might also be faster in such a situation. If not then the opposite situation of not fitting into L1 could be tested as well to see if your parallel loading idea gives more performance. " Any "touched" element is present in L1 after the first access, three registers are enough to manage the entire loop, faster algorithms can be used to process multiple elements in parallel (all pipes busy most of the time), and so on, " ? // store some bull**** in block result, this step could be skipped. // but let's not so compiler doesn't optimize anything away // perhaps block result should be printed just in case. // the last retrieved index is store into it. // do this for A,B,C BlockResult[ BlockIndexA ] = vElementIndexA BlockResult[ BlockIndexB ] = vElementIndexB BlockResult[ BlockIndexC ] = vElementIndexC // once the 3 blocks have been processed go to the next 3 blocks BlockIndexA = vBlockIndexA + 3 BlockIndexB = vBlockIndexB + 3 BlockIndexC = vBlockIndexC + 3 FirstLoopEnd Perhaps this clearifies it a little bit. Let me know if you have any further questions " As shown, your current approach is quite slow " Which one ? The single block version (which should be quite fast) or the three block version (which was your idea ! ). " - it might be worth a thought to change your current design towards something making use of built-in acceleration mechanisms like caching and multiple execution pipelines. " It already does caching by accident more or less, multiple execution pipelines that was tried with your idea. I think it does do both a bit... I don't know exactly how much it already does these things. A processor simulator would be necessary to see what exactly is happening. But I can tell a little bit from re-arranging the code that it does some of these things that you mentioned. So perhaps the "three block" version is already optimal but it's still slower than the "one block" version, at least for 32 bit because then it fits in L1 cache, however different settings should still be tried to see which one if faster then. Also 64 bit assembly code of yours should still be tried, it's not tried yet. " Using multiples of two (rather than ten) simplifies cal- culations markably. Hence, the first thing to change was the size of your array - 4,096 blocks and 8,192 elements don't occupy that much additional memory, 6,217,728 byte are peanuts compared to the entire array. In return, you save a lot of time wasted with calculating addresses and validating indices (AND automatically keeps your indices within the valid range). " As long as loading and writing is not affected by the extra padding bytes, and as long as the example executes correctly then arrays could be padding with extra bytes if it makes address calculations faster without changing the nature of the example. However if the example suddenly produces different results then it would be a problem. " The slightly changed design used addressing modes like mov REG,block_offset[block_base + index * 4] or mov block_offset[block_base + index * 4],REG " What's the difference here ? One seems reading the other seems writing ? So apperently you comparing this code to something else ? But what ? The shifting code ? " allowing to step through the entire array with some very simple (and fast) additions and shifts. For example, the given function required one register for block_base plus block_offset. This had to be updated only once per outer loop, while the inner loop just fetches indexed elements and writes them back. Reducing address calculations to a single register, 12 registers are left for the real work (two registers are required as loop counters). " Ok, I think I now grasp your concept of using the shifting tricks and why it would lead to higher throughput. Now that I look at your shifting code again it seems to be applied to the block number. So at least the blocks could be padded for the elements. So the number of actualy elements would be a bit higher, some elements would be unused. The example should remain with 4000 blocks and 8000 elements "virtually" but in reality there could be more... however the "more" should not be processed and only be there to speed up the addressing. However it remains to be seen what the effect of this padding could be, for single block processing it doesn't seem to be relevant and caching would be uneffected, the padded elements would not be cached except for the very end if hit. For the multiple blocks idea this could waste a little bit of caching space but might not be to bad. So there is some merit in your idea It could be interesting to write some general code to try and always apply this padding trick and the shifting trick Changes are high I will try to write this soon and see what happens, I expect it run a bit faster after reading your comments and explanation of it Bye, Skybuck =D |
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Dimension 8400 w/intel 670 3.8gig processor "Thermal Event" | Brad[_3_] | Dell Computers | 44 | April 23rd 11 11:09 PM |
Idea for Quake 3/Live: "Skill Glow" | Skybuck Flying[_2_] | Nvidia Videocards | 1 | February 22nd 09 08:34 AM |
Can't "unsync" memory bus speed (A8V-E SE) | Hackworth | Asus Motherboards | 2 | September 6th 06 05:28 AM |
P5WD2-E system "hang" after memory size | [email protected] | Asus Motherboards | 12 | July 8th 06 11:24 PM |