A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » Processors » Intel
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

A question about optimization



 
 
Thread Tools Display Modes
  #1  
Old June 30th 03, 05:48 PM
David C.
external usenet poster
 
Posts: n/a
Default A question about optimization

I've got a small fragment of code I need to optimize. It adds an
integer to another integer, modulo a third integer.

In C, this operation is:

int i, j, k, l;

i = (j + k) % l;

I know that in the past integer division was a very expensive operation,
so the following fragment might run faster:

i = j + k;
if (i = l)
i -= l;

This avoids the divide operation, but it introduces a compare and branch
operation, which may be worse.

This code will be used on three different chips: Pentium/MMX, Pentium-II
or Pentium-III.

Using GCC with optimizations, the first example compiles to:

i = (j + k) % l;

lea (%esi,%ecx,1),%edi
mov %edi,%eax
cltd
idivl 0xc(%ebp)

The version with the if statement compiles to:

i = j + k;
if (i = l)
i -= l;

lea (%esi,%ecx,1),%edx
cmp %ebx,%ecx
jl 0000001a foo+0x1a
sub %ebx,%ecx

Off the top of my head, I would assume that four instructions with a
divide will be slower than four instructions with a branch, but I don't
know enough about the x86 architecture on the chips involved (Pentium,
P-II and P-III) to know for sure.

Does anyone here know?

-- David
  #2  
Old July 1st 03, 01:18 PM
Eugene Tackleberry
external usenet poster
 
Posts: n/a
Default

On Mon, 30 Jun 2003 12:48:56 -0400, David C. wrote:

Does anyone here know?


Time it using the RDTSC instruction.
  #3  
Old July 8th 03, 12:31 AM
David Schwartz
external usenet poster
 
Posts: n/a
Default


"David C." wrote in message
...

In C, this operation is:

int i, j, k, l;

i = (j + k) % l;

I know that in the past integer division was a very expensive operation,
so the following fragment might run faster:

i = j + k;
if (i = l)
i -= l;

This avoids the divide operation, but it introduces a compare and branch
operation, which may be worse.


Try

i=j+k;
i-=(i=l) ? l : 0;

See if your compiler is smart enough to optimize it into branchless code
(using the SETLE opcode).

DS


  #4  
Old July 8th 03, 12:40 AM
David Schwartz
external usenet poster
 
Posts: n/a
Default


"David Schwartz" wrote in message
...

See if your compiler is smart enough to optimize it into branchless

code
(using the SETLE opcode).


GCC version 3.3 makes it branchless with CMOVG:

addl %ecx, %eax
movl %eax, %ecx
subl %edx, %ecx
cmpl %edx, %eax
cmovg %ecx, %eax

DS


  #5  
Old July 8th 03, 01:53 AM
David Schwartz
external usenet poster
 
Posts: n/a
Default


"David Schwartz" wrote in message
...

GCC version 3.3 makes it branchless with CMOVG:

addl %ecx, %eax
movl %eax, %ecx
subl %edx, %ecx
cmpl %edx, %eax
cmovg %ecx, %eax


Sorry, these are P2 and later optimizations. If you need to support
earlier processors, you can't use them (or need to implement two ways and
detect the CPU at run time).

DS


  #6  
Old July 8th 03, 08:26 PM
David Schwartz
external usenet poster
 
Posts: n/a
Default


It could also be done with no branches using SETLE, but I doubt that'd
be better.

DS


  #7  
Old July 9th 03, 11:04 AM
David Schwartz
external usenet poster
 
Posts: n/a
Default


"Stephen Lee -- post replies please" wrote in message
...
According to David C. :

int i, j, k, l;

i = (j + k) % l;


if (i = l)
i -= l;


Hmm... Don't you mean "if (i l) i -= l;" ?


Close, he means:

if (i=l) i-=l;

The code can be made branchless and the division reduced to a
multiplication by using 'SETGE' (which is available on a Pentium), but it's
not clear that this is any faster. Basically, you compare 'i' to 'l', use
SETGE to set a register to '1' if and only if 'i=l'. Then multiply that
register by 'l' and (giving you zero if 'il' and 'l' if 'i=l' and then
subtract that from 'i'.

This trades a divide for a multiply, a SETGE, and two subtractions (one
to compare and one at the end). I doubt that would be a win overall.

DS


  #8  
Old July 9th 03, 03:02 PM
John Reiser
external usenet poster
 
Posts: n/a
Default

David Schwartz wrote:
"Stephen Lee -- post replies please" wrote in message
...

According to David C. :

int i, j, k, l;

i = (j + k) % l;



if (i = l)
i -= l;



Hmm... Don't you mean "if (i l) i -= l;" ?



Close, he means:

if (i=l) i-=l;

The code can be made branchless and the division reduced to a
multiplication by using 'SETGE' (which is available on a Pentium), but it's
not clear that this is any faster. Basically, you compare 'i' to 'l', use
SETGE to set a register to '1' if and only if 'i=l'. Then multiply that
register by 'l' and (giving you zero if 'il' and 'l' if 'i=l' and then
subtract that from 'i'.

This trades a divide for a multiply, a SETGE, and two subtractions (one
to compare and one at the end). I doubt that would be a win overall.


SETxx writes only 8 bits, so the upper 24 bits must be written some other way.
You can set the entire register to zero before SETxx, but then using the
32-bit register after SETxx incurs a stall (about 7 to 10 cycles) while
the write merge occurs in the register file. So use MOVZBL after SETxx.
After that, replace the multiply with NEG followed by AND, which saves
about 9 cycles.

As pointed out in another post, using "SBC r,r" after unsigned compare
is the fastest and smallest way, and also works on any x86.

  #9  
Old July 9th 03, 09:29 PM
David Schwartz
external usenet poster
 
Posts: n/a
Default


"John Reiser" wrote in message
...

SETxx writes only 8 bits, so the upper 24 bits must be written some other

way.
You can set the entire register to zero before SETxx, but then using the
32-bit register after SETxx incurs a stall (about 7 to 10 cycles) while
the write merge occurs in the register file.


You can avoid this stall by properly zeroing the register early in the
instruction stream, but this would mean yet another instruction.

So use MOVZBL after SETxx.
After that, replace the multiply with NEG followed by AND, which saves
about 9 cycles.


That's a pretty good idea.

As pointed out in another post, using "SBC r,r" after unsigned compare
is the fastest and smallest way, and also works on any x86.


That's the best idea, assuming you're not willing to use P2-specific
code with CMOV.

DS



 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
A question about dual monitors Woodsy General 1 March 11th 05 06:17 AM
couple of Dimension XPS Gen4 question Matt Dell Computers 3 March 4th 05 02:20 AM
Pentium question Robert Sligh Overclocking AMD Processors 4 April 27th 04 01:13 PM
Question - Printing 2-up on 11x17 Don Printers 1 August 24th 03 04:27 AM
Home PC DVD burner Question. Ruben General Hardware 2 August 18th 03 05:29 PM


All times are GMT +1. The time now is 08:54 AM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.