An idea how to speed up computer programs and avoid waiting. ("event driven memory system")

**Skybuck Flying[_7_]**

Here is an idea:

1. The computer program issue's load and store requests to the memory
system.

Instead of only specifieing the memory address, the computer program would
also specifiy the instruction pointer.

2. The memory system fetches the memory located at the memory address and
returns it but it also returns the instruction pointer.

3. Because the instruction pointer is returned, this allows the computer
program to go back to the instruction that was "stalled" or "skipped".

4. This would allow the computer program to issue memory requests, but not
have to wait for them. The computer program would jump back to the
instruction location once it's memory
has been retrieved. This would allow the computer program to simply move
on/go on with executing other instructions and potentially retrieving other
memory locations.

For example:

Load
Load
Load
Load
Load
Load

^ All these instructions specify an memory address, but also specify the
instruction pointer to main memory, and also specify a register where to
load the memory into.

The question is now:

What to do with other instruction which depend on the registers to be "live"
/ "filled" with data.

Well perhaps these instructions could all check a "flag" inside the
register.

If the register is "live" the instructions get executed. If the register is
"dead" (/"stalled") the instruction is skipped.

This would allow a programmer to program a lot of semi-parallel code which
gets or gets not executed depending on if the memory came in...

The program would then simply jump back at appriorate times... when the
memory comes in... like an interrupt handler or like an exception handler.

So this is probably a totally new instruction flow idea.

Finally if the memory loading takes so long that the program cannot do
anything anymore then the programmer could issue a wait:

Like so:

wait

The processor would then either wait for all loads to complete or some other
interrupts to occur.

So another way to describe this system would be a really short abstract way:

"An event driven memory system".

Bye,
Skybuck.

**Skybuck Flying[_7_]**

I think I now understand why cuda seems to stall on each load (I am not 100%
sure but it seems like stalling).

It's probably because of it's desired "single instruction multiple data"
behaviour.

Cuda wants to wait until all 32 threads have their memory request
forfilled/their register loaded.

Only then can it continue with it's single instruction on "multiple data".

However... if all threads are stalled there is no reason why not all threads
could simply continue with the next instruction if it does not depend on the
previous load, or if the "alive" idea is used.

So again cuda could continue.

I also had a samiliar idea for sse: an instruction which did multiple random
access loads with a single instruction. (never published it though

but
there ya go).

**Wolfgang Kern**

"Flying Bucket" posted one more _new_ idea

Here is an idea:

1. The computer program issue's load and store requests to the memory
system.

Instead of only specifieing the memory address, the computer program would
also specifiy the instruction pointer.

2. The memory system fetches the memory located at the memory address and
returns it but it also returns the instruction pointer.

3. Because the instruction pointer is returned, this allows the computer
program to go back to the instruction that was "stalled" or "skipped".

4. This would allow the computer program to issue memory requests, but not
have to wait for them. The computer program would jump back to the
instruction location once it's memory
has been retrieved. This would allow the computer program to simply move
on/go on with executing other instructions and potentially retrieving
other memory locations.

All this issues are already covered/implemented in x86 architectures,
OoO features exist there since a while (10 years).
__
wolfgang

**Quadibloc**

On Jul 22, 9:06*am, "wolfgang kern" wrote:

All this issues are already covered/implemented in x86 architectures,
OoO features exist there since a while (10 years).

Indeed. If one is going to add something to the machine language,
generally speaking it would be something that allows instructions to
execute fully in sequence without delays. So instead of going back to
an instruction that requested data from memory, there would be
instructions that made a request, and later instructions that used the
data.

John Savard

**Jamie**

Quadibloc wrote:
On Jul 22, 9:06 am, "wolfgang kern" wrote:

All this issues are already covered/implemented in x86 architectures,
OoO features exist there since a while (10 years).

Indeed. If one is going to add something to the machine language,
generally speaking it would be something that allows instructions to
execute fully in sequence without delays. So instead of going back to
an instruction that requested data from memory, there would be
instructions that made a request, and later instructions that used the
data.

John Savard
Guess that is why they have cache !

Jamie

**Skybuck Flying[_7_]**

"wolfgang kern" wrote in message ...

"Flying Bucket" posted one more _new_ idea

Here is an idea:

1. The computer program issue's load and store requests to the memory
system.

Instead of only specifieing the memory address, the computer program would
also specifiy the instruction pointer.

2. The memory system fetches the memory located at the memory address and
returns it but it also returns the instruction pointer.

3. Because the instruction pointer is returned, this allows the computer
program to go back to the instruction that was "stalled" or "skipped".

4. This would allow the computer program to issue memory requests, but not
have to wait for them. The computer program would jump back to the
instruction location once it's memory
has been retrieved. This would allow the computer program to simply move
on/go on with executing other instructions and potentially retrieving
other memory locations.

"
All this issues are already covered/implemented in x86 architectures,
OoO features exist there since a while (10 years).
"

I don't think so...

I have never seen x86 processors execute instructions and then later go back
to previous instructions

I might do this underneath, but I have never seen a debugger jump back

=D

Bye,
Skybuck

**Skybuck Flying[_7_]**

Or maybe I just missed the start of it... here is the start of it:

Using Inline Assembly Code (Win32 Only)

The built-in assembler allows you to write assembly code within Delphi programs. The inline assembler is available

only on the Win32 Delphi compiler. It has the following features:

Allows for inline assembly.

Supports all instructions found in the Intel Pentium 4, Intel MMX extensions, Streaming SIMD Extensions (SSE),

and the AMD Athlon (including 3D Now!).

Provides no macro support, but allows for pure assembly function procedures.

Permits the use of Delphi identifiers, such as constants, types, and variables in assembly statements.

As an alternative to the built-in assembler, you can link to object files that contain external procedures and functions.

See the topic on External declarations for more information. If you have external assembly code that you want to

use in your applications, you should consider rewriting it in the Delphi language or minimally reimplement it using

the inline assembler.

Using the asm Statement

The built-in assembler is accessed through asm statements, which have the form

asm statementList end

where statementList is a sequence of assembly statements separated by semicolons, end-of-line characters, or

Delphi comments.

Comments in an asm statement must be in Delphi style. A semicolon does not indicate that the rest of the line is a

comment.

The reserved word inline and the directive assembler are maintained for backward compatibility only. They have no

effect on the compiler.

"Skybuck Flying" wrote in message news:...
Here some information for you about Delphi's assembler.

I think the calling convention can be "register" or simply "stdcall".

This information was copied from a pdf, the copy & paste is a bit messy but might be of some use.

If you have acrobot reader then you can read the pdf he

http://www.win.tue.nl/~wstomv/edu/de...guageGuide.pdf

Scroll down or click in index: the section is called â€œinline assembly code (win32 only)â€.

â€œ
Understanding Assembler Syntax (Win32 Only)

The inline assembler is available only on the Win32 Delphi compiler. The following material describes the elements

of the assembler syntax necessary for proper use.

Assembler Statement Syntax

Labels

Instruction Opcodes

Assembly Directives

Operands

Assembler Statement Syntax

This syntax of an assembly statement is

Label: Prefix Opcode Operand1, Operand2

where Label is a label, Prefix is an assembly prefix opcode (operation code), Opcode is an assembly instruction

opcode or directive, and Operand is an assembly expression. Label and Prefix are optional. Some opcodes take

only one operand, and some take none.

Comments are allowed between assembly statements, but not within them. For example,

MOV AX,1 {Initial value} { OK }

MOV CX,100 {Count} { OK }

MOV {Initial value} AX,1; { Error! }

MOV CX, {Count} 100 { Error! }

Labels

Labels are used in built-in assembly statements as they are in the Delphi languageby writing the label and a colon

before a statement. There is no limit to a label's length. As in Delphi, labels must be declared in a label declaration

part in the block containing the asm statement. The one exception to this rule is local labels.

Local labels are labels that start with an at-sign (@). They consist of an at-sign followed by one or more letters, digits,

underscores, or at-signs. Use of local labels is restricted to asm statements, and the scope of a local label extends

from the asm reserved word to the end of the asm statement that contains it. A local label doesn't have to be declared.

Instruction Opcodes

The built-in assembler supports all of the Intel-documented opcodes for general application use. Note that operating

system privileged instructions may not be supported. Specifically, the following families of instructions are supported:

Pentium family

Pentium Pro and Pentium II

Pentium III

Pentium 4

227

In addition, the built-in assembler supports the following instruction sets

AMD 3DNow! (from the AMD K6 onwards)

AMD Enhanced 3DNow! (from the AMD Athlon onwards)

For a complete description of each instruction, refer to your microprocessor documentation.

RET instruction sizing

The RET instruction opcode always generates a near return.

Automatic jump sizing

Unless otherwise directed, the built-in assembler optimizes jump instructions by automatically selecting the shortest,

and therefore most efficient, form of a jump instruction. This automatic jump sizing applies to the unconditional jump

instruction (JMP), and to all conditional jump instructions when the target is a label (not a procedure or function).

For an unconditional jump instruction (JMP), the built-in assembler generates a short jump (one-byte opcode followed

by a one-byte displacement) if the distance to the target label is 128 to 127 bytes. Otherwise it generates a near

jump (one-byte opcode followed by a two-byte displacement).

For a conditional jump instruction, a short jump (one-byte opcode followed by a one-byte displacement) is generated

if the distance to the target label is 128 to 127 bytes. Otherwise, the built-in assembler generates a short jump with

the inverse condition, which jumps over a near jump to the target label (five bytes in total). For example, the assembly

statement

JC Stop

where Stop isn't within reach of a short jump, is converted to a machine code sequence that corresponds to this:

JNC Skip

JMP Stop

Skip:

Jumps to the entry points of procedures and functions are always near.

Assembly Directives

The built-in assembler supports three assembly define directives: DB (define byte), DW (define word), and DD (define

double word). Each generates data corresponding to the comma-separated operands that follow the directive.

The DB directive generates a sequence of bytes. Each operand can be a constant expression with a value between

128 and 255, or a character string of any length. Constant expressions generate one byte of code, and strings

generate a sequence of bytes with values corresponding to the ASCII code of each character.

The DW directive generates a sequence of words. Each operand can be a constant expression with a value between

32,768 and 65,535, or an address expression. For an address expression, the built-in assembler generates a near

pointer, a word that contains the offset part of the address.

The DD directive generates a sequence of double words. Each operand can be a constant expression with a value

between 2,147,483,648 and 4,294,967,295, or an address expression. For an address expression, the built-in

assembler generates a far pointer, a word that contains the offset part of the address, followed by a word that contains

the segment part of the address.

The DQ directive defines a quad word for Int64 values.

228

The data generated by the DB, DW, and DD directives is always stored in the code segment, just like the code

generated by other built-in assembly statements. To generate uninitialized or initialized data in the data segment,

you should use Delphi var or const declarations.

Some examples of DB, DW, and DD directives follow.

asm

DB

FFH { One byte }

DB 0,99

{ Two bytes }

DB 'A'

{ Ord('A') }

DB 'Hello world...',0DH,0AH { String followed by CR/LF }

DB 12,'string' { Delphi style string }

DW 0FFFFH { One word }

DW 0,9999

{ Two words }

DW 'A'

{ Same as DB 'A',0 }

DW 'BA'

{ Same as DB 'A','B' }

DW MyVar

{ Offset of MyVar }

DW MyProc

{ Offset of MyProc }

DD 0FFFFFFFFH { One double-word }

DD 0,999999999 { Two double-words }

DD 'A'

{ Same as DB 'A',0,0,0 }

DD 'DCBA'

{ Same as DB 'A','B','C','D' }

DD MyVar

{ Pointer to MyVar }

DD MyProc

{ Pointer to MyProc }

end;

When an identifier precedes a DB, DW , or DD directive, it causes the declaration of a byte-, word-, or double-wordsized

variable at the location of the directive. For example, the assembler allows the following:

ByteVar DB ?

WordVar DW ?

IntVar DD ?

..

..

..

MOV AL,ByteVar

MOV BX,WordVar

MOV ECX,IntVar

The built-in assembler doesn't support such variable declarations. The only kind of symbol that can be defined in an

inline assembly statement is a label. All variables must be declared using Delphi syntax; the preceding construction

can be replaced by

var

229

ByteVar: Byte;

WordVar: Word;

IntVar: Integer;

..

..

..

asm

MOV AL,ByteVar

MOV BX,WordVar

MOV ECX,IntVar

end;

SMALL and LARGE can be used determine the width of a displacement:

MOV EAX, [LARGE $1234]

This instruction generates a 'normal' move with a 32-bit displacement ($00001234).

MOV EAX, [SMALL $1234]

The second instruction will generate a move with an address size override prefix and a 16-bit displacement ($1234).

SMALL can be used to save space. The following example generates an address size override and a 2-byte address

(in total three bytes)

MOV EAX, [SMALL 123]

as opposed to

MOV EAX, [123]

which will generate no address size override and a 4-byte address (in total four bytes).

Two additional directives allow assembly code to access dynamic and virtual methods: VMTOFFSET and

DMTINDEX.

VMTOFFSET retrieves the offset in bytes of the virtual method pointer table entry of the virtual method argument

from the beginning of the virtual method table (VMT). This directive needs a fully specified class name with a method

name as a parameter (for example, TExample.VirtualMethod), or an interface name and an interface method name.

DMTINDEX retrieves the dynamic method table index of the passed dynamic method. This directive also needs a

fully specified class name with a method name as a parameter, for example, TExample.DynamicMethod. To invoke

the dynamic method, call System.@CallDynaInst with the (E)SI register containing the value obtained from

DMTINDEX.

Note: Methods with the message directive are implemented as dynamic methods and can also be called using the

DMTINDEX technique. For example:

TMyClass = class

procedure x; message MYMESSAGE;

end;

The following example uses both DMTINDEX and VMTOFFSET to access dynamic and virtual methods:

230

program Project2;

type

TExample = class

procedure DynamicMethod; dynamic;

procedure VirtualMethod; virtual;

end;

procedure TExample.DynamicMethod;

begin

end;

procedure TExample.VirtualMethod;

begin

end;

procedure CallDynamicMethod(e: TExample);

asm

// Save ESI register

PUSH ESI

// Instance pointer needs to be in EAX

MOV EAX, e

// DMT entry index needs to be in (E)SI

MOV ESI, DMTINDEX TExample.DynamicMethod

// Now call the method

CALL System.@CallDynaInst

// Restore ESI register

POP ESI

end;

procedure CallVirtualMethod(e: TExample);

asm

// Instance pointer needs to be in EAX

MOV EAX, e

// Retrieve VMT table entry

MOV EDX, [EAX]

// Now call the method at offset VMTOFFSET

CALL DWORD PTR [EDX + VMTOFFSET TExample.VirtualMethod]

end;

var

e: TExample;

begin

e := TExample.Create;

try

CallDynamicMethod(e);

CallVirtualMethod(e);

finally

e.Free;

end;

231

end.

Operands

Inline assembler operands are expressions that consist of constants, registers, symbols, and operators.

Within operands, the following reserved words have predefined meanings:

Built-in assembler reserved words

AH CL DX ESP mm4 SHL WORD

AL CS EAX FS mm5 SHR xmm0

AND CX EBP GS mm6 SI xmm1

AX DH EBX HIGH mm7 SMALL xmm2

BH DI ECX LARGE MOD SP xmm3

BL DL EDI LOW NOT SS xmm4

BP CL EDX mm0 OFFSET ST xmm5

BX DMTINDEX EIP mm1 OR TBYTE xmm6

BYTE DS ES mm2 PTR TYPE xmm7

CH DWORD ESI mm3 QWORD VMTOFFSET XOR

Reserved words always take precedence over user-defined identifiers. For example,

var

Ch: Char;

..

..

..

asm

MOV CH, 1

end;

loads 1 into the CH register, not into the Ch variable. To access a user-defined symbol with the same name as a

reserved word, you must use the ampersand (&) override operator:

MOV&Ch, 1

It is best to avoid user-defined identifiers with the same names as built-in assembler reserved words.

232

Assembly Expressions (Win32 Only)

The built-in assembler evaluates all expressions as 32-bit integer values. It doesn't support floating-point and string

values, except string constants. The inline assembler is available only on the Win32 Delphi compiler.

Expressions are built from expression elements and operators, and each expression has an associated expression

class and expression type. This topic covers the following material:

Differences between Delphi and Assembler Expressions

Expression Elements

Expression Classes

Expression Types

Expression Operators

Differences between Delphi and Assembler Expressions

The most important difference between Delphi expressions and built-in assembler expressions is that assembler

expressions must resolve to a constant value. In other words, it must resolve to a value that can be computed at

compile time. For example, given the declarations

const

X = 10;

Y = 20;

var

Z: Integer;

the following is a valid statement.

asm

MOV Z,X+Y

end;

Because both X and Y are constants, the expression X + Y is a convenient way of writing the constant 30, and the

resulting instruction simply moves of the value 30 into the variable Z. But if X and Y are variables

var

X, Y: Integer;

the built-in assembler cannot compute the value of X + Y at compile time. In this case, to move the sum of X

and Y into Z you would use

asm

MOV EAX,X

ADD EAX,Y

MOV Z,EAX

end;

In a Delphi expression, a variable reference denotes the contents of the variable. But in an assembler expression,

a variable reference denotes the address of the variable. In Delphi the expression X + 4 (where X is a variable)

233

means the contents of X plus 4, while to the built-in assembler it means the contents of the word at the address four

bytes higher than the address of X. So, even though you are allowed to write

asm

MOV EAX,X+4

end;

this code doesn't load the value of X plus 4 into AX; instead, it loads the value of a word stored four bytes beyond

X. The correct way to add 4 to the contents of X is

asm

MOV EAX,X

ADD EAX,4

end;

Expression Elements

The elements of an expression are constants, registers, and symbols.

Numeric Constants

Numeric constants must be integers, and their values must be between 2,147,483,648 and 4,294,967,295.

By default, numeric constants use decimal notation, but the built-in assembler also supports binary, octal, and

hexadecimal. Binary notation is selected by writing a B after the number, octal notation by writing an O after the

number, and hexadecimal notation by writing an H after the number or a $ before the number.

Numeric constants must start with one of the digits 0 through 9 or the $ character. When you write a hexadecimal

constant using the H suffix, an extra zero is required in front of the number if the first significant digit is one of the

digits A through F. For example, 0BAD4H and $BAD4 are hexadecimal constants, but BAD4H is an identifier because

it starts with a letter.

String Constants

String constants must be enclosed in single or double quotation marks. Two consecutive quotation marks of the

same type as the enclosing quotation marks count as only one character. Here are some examples of string

constants:

'Z'

'Delphi'

'Linux'

"That's all folks"

'"That''s all folks," he said.'

'100'

'"'

"'"

String constants of any length are allowed in DB directives, and cause allocation of a sequence of bytes containing

the ASCII values of the characters in the string. In all other cases, a string constant can be no longer than four

characters and denotes a numeric value which can participate in an expression. The numeric value of a string

constant is calculated as

234

Ord(Ch1) + Ord(Ch2) shl 8 + Ord(Ch3) shl 16 + Ord(Ch4) shl 24

where Ch1 is the rightmost (last) character and Ch4 is the leftmost (first) character. If the string is shorter than four

characters, the leftmost characters are assumed to be zero. The following table shows string constants and their

numeric values.

String examples and their values

String Value

'a' 00000061H

'ba' 00006261H

'cba' 00636261H

'dcba' 64636261H

'a ' 00006120H

' a' 20202061H

'a' * 2 000000E2H

'a'-'A' 00000020H

not 'a' FFFFFF9EH

Registers

The following reserved symbols denote CPU registers in the inline assembler:

CPU registers

32-bit general purpose EAX EBX ECX EDX 32-bit pointer or index ESP EBP ESI EDI

16-bit general purpose AX BX CX DX 16-bit pointer or index SP BP SI DI

8-bit low registers AL BL CL DL 16-bit segment registers CS DS SS ES

32-bit segment registers FS GS

8-bit high registers AH BH CH DH Coprocessor register stack ST

When an operand consists solely of a register name, it is called a register operand. All registers can be used as

register operands, and some registers can be used in other contexts.

The base registers (BX and BP) and the index registers (SI and DI) can be written within square brackets to indicate

indexing. Valid base/index register combinations are [BX], [BP], [SI], [DI], [BX+SI], [BX+DI], [BP+SI], and [BP+DI].

You can also index with all the 32-bit registersfor example, [EAX+ECX], [ESP], and [ESP+EAX+5].

The segment registers (ES, CS, SS, DS, FS, and GS) are supported, but segments are normally not useful in 32-

bit applications.

The symbol ST denotes the topmost register on the 8087 floating-point register stack. Each of the eight floatingpoint

registers can be referred to using ST(X), where X is a constant between 0 and 7 indicating the distance from

the top of the register stack.

Symbols

The built-in assembler allows you to access almost all Delphi identifiers in assembly language expressions, including

constants, types, variables, procedures, and functions. In addition, the built-in assembler implements the special

symbol @Result, which corresponds to the Result variable within the body of a function. For example, the function

235

function Sum(X, Y: Integer): Integer;

begin

Result := X + Y;

end;

could be written in assembly language as

function Sum(X, Y: Integer): Integer; stdcall;

begin

asm

MOV EAX,X

ADD EAX,Y

MOV @Result,EAX

end;

end;

The following symbols cannot be used in asm statements:

Standard procedures and functions (for example, WriteLn and Chr).

String, floating-point, and set constants (except when loading registers).

Labels that aren't declared in the current block.

The @Result symbol outside of functions.

The following table summarizes the kinds of symbol that can be used in asm statements.

Symbols recognized by the built-in assembler

Symbol Value Class Type

Label Address of label Memory reference Size of type

Constant Value of constant Immediate value 0

Type 0 Memory reference Size of type

Field Offset of field Memory Size of type

Variable Address of variable or address of a pointer to the variable Memory reference Size of type

Procedure Address of procedure Memory reference Size of type

Function Address of function Memory reference Size of type

Unit 0 Immediate value 0

@Result Result variable offset Memory reference Size of type

With optimizations disabled, local variables (variables declared in procedures and functions) are always allocated

on the stack and accessed relative to EBP, and the value of a local variable symbol is its signed offset from EBP.

The assembler automatically adds [EBP] in references to local variables. For example, given the declaration

var Count: Integer;

within a function or procedure, the instruction

MOV EAX,Count

assembles into MOV EAX,[EBP4].

236

The built-in assembler treats var parameters as a 32-bit pointers, and the size of a var parameter is always 4. The

syntax for accessing a var parameter is different from that for accessing a value parameter. To access the contents

of a var parameter, you must first load the 32-bit pointer and then access the location it points to. For example,

function Sum(var X, Y: Integer): Integer; stdcall;

begin

asm

MOV EAX,X

MOV EAX,[EAX]

MOV EDX,Y

ADD EAX,[EDX]

MOV @Result,EAX

end;

end;

Identifiers can be qualified within asm statements. For example, given the declarations

type

TPoint = record

X, Y: Integer;

end;

TRect = record

A, B: TPoint;

end;

var

P: TPoint;

R: TRect;

the following constructions can be used in an asm statement to access fields.

MOV EAX,P.X

MOV EDX,P.Y

MOV ECX,R.A.X

MOV EBX,R.B.Y

A type identifier can be used to construct variables on the fly. Each of the following instructions generates the same

machine code, which loads the contents of [EDX] into EAX.

MOV EAX,(TRect PTR [EDX]).B.X

MOV EAX,TRect([EDX]).B.X

MOV EAX,TRect[EDX].B.X

MOV EAX,[EDX].TRect.B.X

Expression Classes

The built-in assembler divides expressions into three classes: registers, memory references, and immediate values.

An expression that consists solely of a register name is a register expression. Examples of register expressions are

AX, CL, DI, and ES. Used as operands, register expressions direct the assembler to generate instructions that

operate on the CPU registers.

Expressions that denote memory locations are memory references. Delphi's labels, variables, typed constants,

procedures, and functions belong to this category.

237

Expressions that aren't registers and aren't associated with memory locations are immediate values. This group

includes Delphi's untyped constants and type identifiers.

Immediate values and memory references cause different code to be generated when used as operands. For

example,

const

Start = 10;

var

Count: Integer;

..

..

..

asm

MOV EAX,Start { MOV EAX,xxxx }

MOV EBX,Count { MOV EBX,[xxxx] }

MOV ECX,[Start] { MOV ECX,[xxxx] }

MOV EDX,OFFSET Count { MOV EDX,xxxx }

end;

Because Start is an immediate value, the first MOV is assembled into a move immediate instruction. The second

MOV, however, is translated into a move memory instruction, as Count is a memory reference. In the third MOV,

the brackets convert Start into a memory reference (in this case, the word at offset 10 in the data segment). In the

fourth MOV, the OFFSET operator converts Count into an immediate value (the offset of Count in the data segment).

The brackets and OFFSET operator complement each other. The following asm statement produces identical

machine code to the first two lines of the previous asm statement.

asm

MOV EAX,OFFSET [Start]

MOV EBX,[OFFSET Count]

end;

Memory references and immediate values are further classified as either relocatable or absolute. Relocation is the

process by which the linker assigns absolute addresses to symbols. A relocatable expression denotes a value that

requires relocation at link time, while an absolute expression denotes a value that requires no such relocation.

Typically, expressions that refer to labels, variables, procedures, or functions are relocatable, since the final address

of these symbols is unknown at compile time. Expressions that operate solely on constants are absolute.

The built-in assembler allows you to carry out any operation on an absolute value, but it restricts operations on

relocatable values to addition and subtraction of constants.

Expression Types

Every built-in assembler expression has a typeor, more correctly, a size, because the assembler regards the type

of an expression simply as the size of its memory location. For example, the type of an Integer variable is four,

because it occupies 4 bytes. The built-in assembler performs type checking whenever possible, so in the instructions

var

QuitFlag: Boolean;

OutBufPtr: Word;

..

..

..

asm

238

MOV AL,QuitFlag

MOV BX,OutBufPtr

end;

the assembler checks that the size of QuitFlag is one (a byte), and that the size of OutBufPtr is two (a word).

The instruction

MOV DL,OutBufPtr

produces an error because DL is a byte-sized register and OutBufPtr is a word. The type of a memory reference

can be changed through a typecast; these are correct ways of writing the previous instruction:

MOV DL,BYTE PTR OutBufPtr

MOV DL,Byte(OutBufPtr)

MOV DL,OutBufPtr.Byte

These MOV instructions all refer to the first (least significant) byte of the OutBufPtr variable.

In some cases, a memory reference is untyped. One example is an immediate value (Buffer) enclosed in square

brackets:

procedure Example(var Buffer);

asm

MOV AL, [Buffer]

MOV CX, [Buffer]

MOV EDX, [Buffer]

end;

The built-in assembler permits these instructions, because the expression [Buffer] has no typeit just means "the

contents of the location indicated by Buffer," and the type can be determined from the first operand (byte for AL,

word for CX, and double-word for EDX.

In cases where the type can't be determined from another operand, the built-in assembler requires an explicit

typecast. For example,

INC BYTE PTR [ECX]

IMUL WORD PTR [EDX]

The following table summarizes the predefined type symbols that the built-in assembler provides in addition to any

currently declared Delphi types.

Predefined type symbols

Symbol Type

BYTE 1

WORD 2

DWORD 4

QWORD 8

TBYTE 10

239

Expression Operators

The built-in assembler provides a variety of operators. Precedence rules are different from that of the Delphi

language; for example, in an asm statement, AND has lower precedence than the addition and subtraction operators.

The following table lists the built-in assembler's expression operators in decreasing order of precedence.

Precedence of built-in assembler expression operators

Operators Remarks Precedence

& highest

(... ), [... ],., HIGH, LOW

+, - unary + and -

:

OFFSET, TYPE, PTR, *, /, MOD, SHL, SHR, +, - binary + and -

NOT, AND, OR, XOR lowest

The following table defines the built-in assembler's expression operators.

Definitions of built-in assembler expression operators

Operator Description

& Identifier override. The identifier immediately following the ampersand is treated as a user-defined symbol, even

if the spelling is the same as a built-in assembler reserved symbol.

(... ) Subexpression. Expressions within parentheses are evaluated completely prior to being treated as a single

expression element. Another expression can precede the expression within the parentheses; the result in this case

is the sum of the values of the two expressions, with the type of the first expression.

[... ] Memory reference. The expression within brackets is evaluated completely prior to being treated as a single

expression element. Another expression can precede the expression within the brackets; the result in this case is

the sum of the values of the two expressions, with the type of the first expression. The result is always a memory

reference.

.. Structure member selector. The result is the sum of the expression before the period and the expression after

the period, with the type of the expression after the period. Symbols belonging to the scope identified by the

expression before the period can be accessed in the expression after the period.

HIGH Returns the high-order 8 bits of the word-sized expression following the operator. The expression must be an

absolute immediate value.

LOW Returns the low-order 8 bits of the word-sized expression following the operator. The expression must be an

absolute immediate value.

+ Unary plus. Returns the expression following the plus with no changes. The expression must be an absolute

immediate value.

- Unary minus. Returns the negated value of the expression following the minus. The expression must be an absolute

immediate value.

+ Addition. The expressions can be immediate values or memory references, but only one of the expressions can

be a relocatable value. If one of the expressions is a relocatable value, the result is also a relocatable value. If either

of the expressions is a memory reference, the result is also a memory reference.

- Subtraction. The first expression can have any class, but the second expression must be an absolute immediate

value. The result has the same class as the first expression.

: Segment override. Instructs the assembler that the expression after the colon belongs to the segment given by

the segment register name (CS, DS, SS, FS, GS, or ES) before the colon. The result is a memory reference with

the value of the expression after the colon. When a segment override is used in an instruction operand, the

instruction is prefixed with an appropriate segment-override prefix instruction to ensure that the indicated segment

is selected.

OFFSET Returns the offset part (double word) of the expression following the operator. The result is an immediate value.

240

TYPE Returns the type (size in bytes) of the expression following the operator. The type of an immediate value is 0.

PTR Typecast operator. The result is a memory reference with the value of the expression following the operator and

the type of the expression in front of the operator.

* Multiplication. Both expressions must be absolute immediate values, and the result is an absolute immediate

value.

/ Integer division. Both expressions must be absolute immediate values, and the result is an absolute immediate

value.

MOD Remainder after integer division. Both expressions must be absolute immediate values, and the result is an

absolute immediate value.

SHL Logical shift left. Both expressions must be absolute immediate values, and the result is an absolute immediate

value.

SHR Logical shift right. Both expressions must be absolute immediate values, and the result is an absolute immediate

value.

NOT Bitwise negation. The expression must be an absolute immediate value, and the result is an absolute immediate

value.

AND Bitwise AND. Both expressions must be absolute immediate values, and the result is an absolute immediate value.

OR Bitwise OR. Both expressions must be absolute immediate values, and the result is an absolute immediate value.

XOR Bitwise exclusive OR. Both expressions must be absolute immediate values, and the result is an absolute

immediate value.

241

Assembly Procedures and Functions (Win32 Only)

You can write complete procedures and functions using inline assembly language code, without including a

begin...end statement. This topic covers these issues:

Compiler Optimizations.

Function Results.

The inline assembler is available only on the Win32 Delphi compiler.

Compiler Optimizations

An example of the type of function you can write is as follows:

function LongMul(X, Y: Integer): Longint;

asm

MOV EAX,X

IMUL Y

end;

The compiler performs several optimizations on these routines:

No code is generated to copy value parameters into local variables. This affects all string-type value parameters

and other value parameters whose size isn't 1, 2, or 4 bytes. Within the routine, such parameters must be treated

as if they were var parameters.

Unless a function returns a string, variant, or interface reference, the compiler doesn't allocate a function result

variable; a reference to the @Result symbol is an error. For strings, variants, and interfaces, the caller always

allocates an @Result pointer.

The compiler only generates stack frames for nested routines, for routines that have local parameters, or for

routines that have parameters on the stack.

Locals is the size of the local variables and Params is the size of the parameters. If both Locals and Params

are zero, there is no entry code, and the exit code consists simply of a RET instruction.

The automatically generated entry and exit code for the routine looks like this:

PUSH EBP ;Present if Locals 0 or Params 0

MOV EBP,ESP ;Present if Locals 0 or Params 0

SUB ESP,Locals ;Present if Locals 0

....

MOV ESP,EBP ;Present if Locals 0

POP EBP ;Present if Locals 0 or Params 0

RET Params ;Always present

If locals include variants, long strings, or interfaces, they are initialized to zero but not finalized.

Function Results

Assembly language functions return their results as follows.

Ordinal values are returned in AL (8-bit values), AX (16-bit values), or EAX (32-bit values).

Real values are returned in ST(0) on the coprocessor's register stack. (Currency values are scaled by 10000.)

242

Pointers, including long strings, are returned in EAX.

Short strings and variants are returned in the temporary location pointed to by @Result.

"

Bye,

Skybuck.

**Skybuck Flying[_7_]**

"Skybuck Flying" wrote in message news:...

Also in case you have troubles understanding the pascal code then I shall
try and write down some simply and easy to understand pseudo code and some
explanations/comments, let's see:

What the code does and assumes is the following:

general description, this information not that important but might give you
the general idea:

"Memory" is a pointer towards a big linear/sequentially memory block of
bytes/integers. The memory block contains many integers. These integers are
"virtually" subdivided into seperate blocks.

Each block contains 8000 integers. This is indicated by ElementCount, so
ElementCount equals 8000.

There are 4000 blocks. This is indicated by BlockCount, so BlockCount equals
4000.

So in total there are 4000 blocks * 8000 elements * 4 bytes = xxxx bytes of
memory.

Each block is like a chain of indexes.

Each index point towards the next index within it's block.

The indexes are thrown around inside the block during initialisation to
create a random access pattern.

This is also known as the "pointer chasing problem" so it could also be
called the "index chasing problem".

Index - Index - Index - Index - Index and so forth.

The first index must be retrieved from memory, only then does one know where
the next index is located and so on.

There is an ElementIndex variable which is initialized to zero.

So each ElementIndex simply starts at element 0 of block 0.

There could be 8000 ElementIndexes in parallel all starting at element 0 of
block X.

Or there could simply be on ElementIndex and process each block in turn.

in version 0.03 "BlockBase" was introduced.

BlockBase as an index which points towards the first element of a block.

So it's a storage to prevent multiplications all the time.

In the code below 3 blocks in parallel are attempted.

Now for some further pseudo code:

// routine

// variables:

// each block is processed repeatedly loop count times.
// loop index indicates the current loop/round
LoopIndex

BlockIndexA // indicates block a number
BlockIndexB // indicates block b number
BlockIndexC // indicates block c number

ElementIndexA // indicates block a element index
ElementIndexB // indicates block b element index
ElementIndexC // indicates block c element index

ElementCount // number of elements per block
BlockCount // number of blocks
LoopCount // number of loops/chases per block

BlockBaseA // starting index of block a
BlockBaseB // starting index of block b
BlockBaseC // starting index of block c

Memory // contains all integers of all blocks of all elements. So it's an
array/block of elements/integers.

RoutineBegin

ElementCount = 8000
BlockCount = 4000
LoopCount = 80000

BlockIndexA = 0
BlockIndexB = 1
BlockIndexC = 2

// loop over all blocks, process 3 blocks per loop iteration.
// slight correction:
// wrong: "BlockIndex goes from 0 to 7999 divided over 3 indexes"
// somewhat correct: "BlockIndex goes from 0 to 3999 divided over 3 indexes"
// so BlockIndexA is 0, 3, 6, 9, 12, etc
// so BlockIndexB is 1, 4, 7, 10, 13, etc
// so BlockIndexC is 2, 5, 8, 11, 14, etc
FirstLoopBegin

// calculate the starting index of each block.
// formula is: Base/Index/Offset = block number * number of elements per
block.
BlockBaseA = BlockIndexA * vElementCount;
BlockBaseB = BlockIndexB * vElementCount;
BlockBaseC = BlockIndexC * vElementCount;

// initialise each element index to the first index of the block.
ElementIndexA = 0
ElementIndexB = 0
ElementIndexC = 0

// loop 80000 times through the block arrays/elements
SecondLoopBegin "LoopIndex goes from 0 to 79999"

// Seek into memory at location BlockBase + ElementIndex
// do this for A, B and C:
// retrieve the new element index via the old element index
ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]

SecondLoopEnd

// store some bull**** in block result, this step could be skipped.
// but let's not so compiler doesn't optimize anything away
// perhaps block result should be printed just in case.
// the last retrieved index is store into it.
// do this for A,B,C

BlockResult[ BlockIndexA ] = vElementIndexA
BlockResult[ BlockIndexB ] = vElementIndexB
BlockResult[ BlockIndexC ] = vElementIndexC

// once the 3 blocks have been processed go to the next 3 blocks
BlockIndexA = vBlockIndexA + 3
BlockIndexB = vBlockIndexB + 3
BlockIndexC = vBlockIndexC + 3

FirstLoopEnd

Perhaps this clearifies it a little bit.

Let me know if you have any further questions

Bye,
Skybuck.

**Bernhard Schornak**

Skybuck Flying wrote:

"Skybuck Flying" wrote in message news:...

Also in case you have troubles understanding the pascal code then I shall
try and write down some simply and easy to understand pseudo code and some
explanations/comments, let's see:

What the code does and assumes is the following:

general description, this information not that important but might give you
the general idea:

"Memory" is a pointer towards a big linear/sequentially memory block of
bytes/integers. The memory block contains many integers. These integers are
"virtually" subdivided into seperate blocks.

Each block contains 8000 integers. This is indicated by ElementCount, so
ElementCount equals 8000.

There are 4000 blocks. This is indicated by BlockCount, so BlockCount equals
4000.

So in total there are 4000 blocks * 8000 elements * 4 bytes = xxxx bytes of
memory.

Each block is like a chain of indexes.

Each index point towards the next index within it's block.

The indexes are thrown around inside the block during initialisation to
create a random access pattern.

This is also known as the "pointer chasing problem" so it could also be
called the "index chasing problem".

Index - Index - Index - Index - Index and so forth.

The first index must be retrieved from memory, only then does one know where
the next index is located and so on.

There is an ElementIndex variable which is initialized to zero.

So each ElementIndex simply starts at element 0 of block 0.

There could be 8000 ElementIndexes in parallel all starting at element 0 of
block X.

Or there could simply be on ElementIndex and process each block in turn.

in version 0.03 "BlockBase" was introduced.

BlockBase as an index which points towards the first element of a block.

So it's a storage to prevent multiplications all the time.

4000 is 0x0FA0, 8000 is 0x1F40. If we expand these sizes
to 4096 and 8192, we get ranges 0x0000...0x0FFF(0x1FFF).
This allows simple bit masks to sort out invalid indices
and simplifies address calculations markably.

If we see the entire thing as an array with 4096 pages à
8192 elements, we just have to create a large array with
a size of 4096 * 8192 * 4 = 134,217,728 byte.

In the code below 3 blocks in parallel are attempted.

A very bad idea. L1 cache is 65,536 byte on recent CPUs,
but only 32,768 byte on future processors. e.g. Zambezi.
Three blocks occupy 3 * 32,768 = 98,304 byte, so: 32,768
(or 65,536!) byte do not fit into L1. The result: Memory
access is slowed down from 3 (L1) to 6 (L2), 12 (L3) or
27 (RAM) clocks per access, depending on where the cache
lines have to be fetched from or written to.

If you want to process multiple blocks in parallel, it's
better to check, how many physical processors exist on a
user's machine, then create that amount threads (minus 1
for the OS) running on one physical processor, each.

Now for some further pseudo code:

// routine

// variables:

// each block is processed repeatedly loop count times.
// loop index indicates the current loop/round
LoopIndex

BlockIndexA // indicates block a number
BlockIndexB // indicates block b number
BlockIndexC // indicates block c number

ElementIndexA // indicates block a element index
ElementIndexB // indicates block b element index
ElementIndexC // indicates block c element index

ElementCount // number of elements per block
BlockCount // number of blocks
LoopCount // number of loops/chases per block

BlockBaseA // starting index of block a
BlockBaseB // starting index of block b
BlockBaseC // starting index of block c

Memory // contains all integers of all blocks of all elements. So it's an
array/block of elements/integers.

RoutineBegin

ElementCount = 8000
BlockCount = 4000
LoopCount = 80000

BlockIndexA = 0
BlockIndexB = 1
BlockIndexC = 2

// loop over all blocks, process 3 blocks per loop iteration.
// slight correction:
// wrong: "BlockIndex goes from 0 to 7999 divided over 3 indexes"
// somewhat correct: "BlockIndex goes from 0 to 3999 divided over 3 indexes"
// so BlockIndexA is 0, 3, 6, 9, 12, etc
// so BlockIndexB is 1, 4, 7, 10, 13, etc
// so BlockIndexC is 2, 5, 8, 11, 14, etc
FirstLoopBegin

// calculate the starting index of each block.
// formula is: Base/Index/Offset = block number * number of elements per
block.
BlockBaseA = BlockIndexA * vElementCount;
BlockBaseB = BlockIndexB * vElementCount;
BlockBaseC = BlockIndexC * vElementCount;

With 8,192 elements per block, the multiplication can be
replaced by a shift:

BlockBaseN = BlockIndexN 15

reducing the VectorPath MUL (blocking all pipes for five
clocks) to a DirectPath SHL (one clock, one pipe).

// initialise each element index to the first index of the block.
ElementIndexA = 0
ElementIndexB = 0
ElementIndexC = 0

// loop 80000 times through the block arrays/elements
SecondLoopBegin "LoopIndex goes from 0 to 79999"

// Seek into memory at location BlockBase + ElementIndex
// do this for A, B and C:
// retrieve the new element index via the old element index
ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]

SecondLoopEnd

I still do not understand what this is good for, but you
should consider to process one block after the other. If
you are eager to introduce something like "parallel pro-
cessing", divide your entire array into as much parts as
there are processors (hopefully a multiple of two), then
let those threads process a contiguous part of memory in
parallel.

If you prefer a single threaded solution, it's better to
process one block after the other, because entire blocks
fit into L1 completely. Any "touched" element is present
in L1 after the first access, three registers are enough
to manage the entire loop, faster algorithms can be used
to process multiple elements in parallel (all pipes busy
most of the time), and so on,

// store some bull**** in block result, this step could be skipped.
// but let's not so compiler doesn't optimize anything away
// perhaps block result should be printed just in case.
// the last retrieved index is store into it.
// do this for A,B,C

BlockResult[ BlockIndexA ] = vElementIndexA
BlockResult[ BlockIndexB ] = vElementIndexB
BlockResult[ BlockIndexC ] = vElementIndexC

// once the 3 blocks have been processed go to the next 3 blocks
BlockIndexA = vBlockIndexA + 3
BlockIndexB = vBlockIndexB + 3
BlockIndexC = vBlockIndexC + 3

FirstLoopEnd

Perhaps this clearifies it a little bit.

Let me know if you have any further questions

As shown, your current approach is quite slow - it might
be worth a thought to change your current design towards
something making use of built-in acceleration mechanisms
like caching and multiple execution pipelines.

Using multiples of two (rather than ten) simplifies cal-
culations markably. Hence, the first thing to change was
the size of your array - 4,096 blocks and 8,192 elements
don't occupy that much additional memory, 6,217,728 byte
are peanuts compared to the entire array. In return, you
save a lot of time wasted with calculating addresses and
validating indices (AND automatically keeps your indices
within the valid range).

The slightly changed design used addressing modes like

mov REG,block_offset[block_base + index * 4] or

mov block_offset[block_base + index * 4],REG

allowing to step through the entire array with some very
simple (and fast) additions and shifts. For example, the
given function required one register for block_base plus
block_offset. This had to be updated only once per outer
loop, while the inner loop just fetches indexed elements
and writes them back. Reducing address calculations to a
single register, 12 registers are left for the real work
(two registers are required as loop counters).

Greetings from Augsburg

Bernhard Schornak

**Skybuck Flying[_7_]**

"
4000 is 0x0FA0, 8000 is 0x1F40.
"

Ok this could be handy to know to spot these constants/parameters
potentially in your code.

"
If we expand these sizes
to 4096 and 8192, we get ranges 0x0000...0x0FFF(0x1FFF).
This allows simple bit masks to sort out invalid indices
and simplifies address calculations markably.
"

Why would you want to spot invalid indices ? At least my pascal code seems
already ok. Perhaps this is to help you debug your assembler code ?

Address calculations should be possible for any setting, this allows
flexibility in testing different scenerio's.

"
If we see the entire thing as an array with 4096 pages à
8192 elements, we just have to create a large array with
a size of 4096 * 8192 * 4 = 134,217,728 byte.
"

Irrelevant any size can be allocated easily, at least in a high level
language, perhaps this helps you in assembler ?

In the code below 3 blocks in parallel are attempted.

"
A very bad idea.
"

This was your idea after all

It might be bad for current settings, but maybe there is some value in it
for other settings.

"
L1 cache is 65,536 byte on recent CPUs,
but only 32,768 byte on future processors. e.g. Zambezi.
Three blocks occupy 3 * 32,768 = 98,304 byte, so: 32,768
(or 65,536!) byte do not fit into L1. The result: Memory
access is slowed down from 3 (L1) to 6 (L2), 12 (L3) or
27 (RAM) clocks per access, depending on where the cache
lines have to be fetched from or written to.

If you want to process multiple blocks in parallel, it's
better to check, how many physical processors exist on a
user's machine, then create that amount threads (minus 1
for the OS) running on one physical processor, each.
"

So far the examples have all executed well inside the cache because of the
low settings which almost fit in L1 cache.

However it would also be interesting to change settings to for example a
block size of 1920x1200 elements to mimic a single frame of a video codec
for example, which might want to do some random access memory.

So then the L1 cache will probably be whoefully short and then you can still
try and see if your "pairing/parallel" processing of frames might give
higher throughput.

However I already do see a little problem with this idea. Usually video
codec frames are processed sequentially, so it might be difficult to
impossible to process multiple at the same time.

However a frame does consists out of r,g,b and sometimes maybe even alpha
channel.

So there could be 3 to 4 channels, so your idea of processing 3 blocks at
the same time might still have some merit, when each color channel is
considered a block.

Another interesting setting could be to reduce the settings to make it fit
entirely into the L1 cache to see if your pairing/multiple loading code also
becomes faster for when it does fit in cache.

Writing multi threading code for "random access memory" could become tricky.
Especially if writes are involved as well.
Processors might conflict with each other.

As long as processors operate within their own blocks and can never access
each other blocks then it should be ok.
Some care should still be taken at the border of the blocks, inserting some
paddings there could help prevent accidently/unlucky simultanious access. So
it does become a little bit more tricky to write code like that... because
of the padding, allocations have to be done a little bit different, as well
as formula's. I can see it become a bit messy though some variable names
could help solve the mess as follows:

Allocate( mElementCount + mPaddingCount );

^ Then padding count could be used in formula's/code where necessary or left
out where not necessary.

Like mod mElementCount to setup indexes correctly.

"
With 8,192 elements per block, the multiplication can be
replaced by a shift:

BlockBaseN = BlockIndexN 15

reducing the VectorPath MUL (blocking all pipes for five
clocks) to a DirectPath SHL (one clock, one pipe).
"

While to explanation of this optimization idea/trick is interesting
unfortunately it cannot be really used in the example code,

The example code needs to remain flexible to try out different scenerio's to
see if your idea can be usefull for something

// initialise each element index to the first index of the block.
ElementIndexA = 0
ElementIndexB = 0
ElementIndexC = 0

// loop 80000 times through the block arrays/elements
SecondLoopBegin "LoopIndex goes from 0 to 79999"

// Seek into memory at location BlockBase + ElementIndex
// do this for A, B and C:
// retrieve the new element index via the old element index
ElementIndexA = mMemory[ vBlockBaseA + vElementIndexA ]
ElementIndexB = mMemory[ vBlockBaseB + vElementIndexB ]
ElementIndexC = mMemory[ vBlockBaseC + vElementIndexC ]

SecondLoopEnd

"
I still do not understand what this is good for, but you
should consider to process one block after the other. If
you are eager to introduce something like "parallel pro-
cessing", divide your entire array into as much parts as
there are processors (hopefully a multiple of two), then
let those threads process a contiguous part of memory in
parallel.
"

The code above represents your "parallel block" idea, where there would be 3
loads in parallel.

The mMemory[ ... ] is a load. See one of your earlier postings of yourself
where you write about that idea.

You "claimed" the X2 processor can do 3 loads in parallel more or less.

The code above tries to take adventage of that.

"
If you prefer a single threaded solution, it's better to
process one block after the other, because entire blocks
fit into L1 completely.
"

Perhaps your parallel loading idea might also be faster in such a situation.

If not then the opposite situation of not fitting into L1 could be tested as
well to see if your parallel loading idea gives more performance.

"
Any "touched" element is present
in L1 after the first access, three registers are enough
to manage the entire loop, faster algorithms can be used
to process multiple elements in parallel (all pipes busy
most of the time), and so on,
"

?

// store some bull**** in block result, this step could be skipped.
// but let's not so compiler doesn't optimize anything away
// perhaps block result should be printed just in case.
// the last retrieved index is store into it.
// do this for A,B,C

BlockResult[ BlockIndexA ] = vElementIndexA
BlockResult[ BlockIndexB ] = vElementIndexB
BlockResult[ BlockIndexC ] = vElementIndexC

// once the 3 blocks have been processed go to the next 3 blocks
BlockIndexA = vBlockIndexA + 3
BlockIndexB = vBlockIndexB + 3
BlockIndexC = vBlockIndexC + 3

FirstLoopEnd

Perhaps this clearifies it a little bit.

Let me know if you have any further questions

"
As shown, your current approach is quite slow
"

Which one ?

The single block version (which should be quite fast) or the three block
version (which was your idea !

).

"
- it might
be worth a thought to change your current design towards
something making use of built-in acceleration mechanisms
like caching and multiple execution pipelines.
"

It already does caching by accident more or less, multiple execution
pipelines that was tried with your idea.

I think it does do both a bit... I don't know exactly how much it already
does these things.

A processor simulator would be necessary to see what exactly is happening.

But I can tell a little bit from re-arranging the code that it does some of
these things that you mentioned.

So perhaps the "three block" version is already optimal but it's still
slower than the "one block" version, at least for 32 bit because then it
fits in L1 cache, however different settings should still be tried to see
which one if faster then.

Also 64 bit assembly code of yours should still be tried, it's not tried
yet.

"
Using multiples of two (rather than ten) simplifies cal-
culations markably. Hence, the first thing to change was
the size of your array - 4,096 blocks and 8,192 elements
don't occupy that much additional memory, 6,217,728 byte
are peanuts compared to the entire array. In return, you
save a lot of time wasted with calculating addresses and
validating indices (AND automatically keeps your indices
within the valid range).
"

As long as loading and writing is not affected by the extra padding bytes,
and as long as the example executes correctly then arrays could be padding
with extra bytes if it makes address calculations faster without changing
the nature of the example.

However if the example suddenly produces different results then it would be
a problem.

"
The slightly changed design used addressing modes like

mov REG,block_offset[block_base + index * 4] or

mov block_offset[block_base + index * 4],REG
"

What's the difference here ?

One seems reading the other seems writing ?

So apperently you comparing this code to something else ? But what ?

The shifting code ?

"
allowing to step through the entire array with some very
simple (and fast) additions and shifts. For example, the
given function required one register for block_base plus
block_offset. This had to be updated only once per outer
loop, while the inner loop just fetches indexed elements
and writes them back. Reducing address calculations to a
single register, 12 registers are left for the real work
(two registers are required as loop counters).
"

Ok, I think I now grasp your concept of using the shifting tricks and why it
would lead to higher throughput.

Now that I look at your shifting code again it seems to be applied to the
block number.

So at least the blocks could be padded for the elements. So the number of
actualy elements would be a bit higher, some elements would be unused.

The example should remain with 4000 blocks and 8000 elements "virtually" but
in reality there could be more... however the "more" should not be processed
and only be there to speed up the addressing.

However it remains to be seen what the effect of this padding could be, for
single block processing it doesn't seem to be relevant and caching would be
uneffected, the padded elements would not be cached except for the very end
if hit.

For the multiple blocks idea this could waste a little bit of caching space
but might not be to bad.

So there is some merit in your idea

It could be interesting to write some general code to try and always apply
this padding trick and the shifting trick

Changes are high I will try to write this soon and see what happens, I
expect it run a bit faster after reading your comments and explanation of it

Bye,
Skybuck =D

Thread Tools
Show Printable Version Email this Page
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Dimension 8400 w/intel 670 3.8gig processor "Thermal Event"	Brad[_3_]	Dell Computers	44	April 23rd 11 11:09 PM
Idea for Quake 3/Live: "Skill Glow"	Skybuck Flying[_2_]	Nvidia Videocards	1	February 22nd 09 08:34 AM
Can't "unsync" memory bus speed (A8V-E SE)	Hackworth	Asus Motherboards	2	September 6th 06 05:28 AM
P5WD2-E system "hang" after memory size	[email protected]	Asus Motherboards	12	July 8th 06 11:24 PM