A computer components & hardware forum. HardwareBanter

If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below.

Go Back   Home » HardwareBanter forum » Processors » Intel
Site Map Home Register Authors List Search Today's Posts Mark Forums Read Web Partners

Can someone explain step by step how one avoid many conditional in forth as described in Moore Fourth essay?



 
 
Thread Tools Display Modes
  #1  
Old January 11th 12, 01:21 PM posted to comp.lang.forth,comp.sys.intel,comp.arch
Rod Pemberton[_3_]
external usenet poster
 
Posts: 1
Default Can someone explain step by step how one avoid many conditional in forth as described in Moore Fourth essay?

"Arnold Doray" wrote in message
...
....

[Forth code]


CPU pipelining is improved by reducing conditionals. Modern CPUs have
branch prediction, but these aren't always successful, in which case the
pipline needs to be flushed, lowering the CPU's throughput.


Is that still true for multiple cores?

I.e., I would think the following is entirely possible and plausible, but I
haven't studied a CPU design in decades. The CPU's designers could execute
the process in parallel with one core taking one direction for the branch
and the another core taking the other branch direction. Once the correct
branch is determined, the bad execution path is discarded. If the primary
core had the good execution path, it just continues execution. If the
alternate core had the good execution path, it's internal state could be
"pushed" to the primary core. If they used static ram for the internal
state, then it could be "pushed" asynchronously, i.e., between clocks or
sub-clocks. It would require reserving a core for the branch path
execution, at least temporarily.


Rod Pemberton



  #2  
Old January 11th 12, 01:29 PM posted to comp.lang.forth,comp.sys.intel,comp.arch
Alex McDonald[_2_]
external usenet poster
 
Posts: 1
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

On Jan 11, 12:21*pm, "Rod Pemberton"
wrote:
"Arnold Doray" wrote in message

...
...

[Forth code]


CPU pipelining is improved by reducing conditionals. Modern CPUs have
branch prediction, but these aren't always successful, in which case the
pipline needs to be flushed, lowering the CPU's throughput.


Is that still true for multiple cores?

I.e., I would think the following is entirely possible and plausible, but I
haven't studied a CPU design in decades. *The CPU's designers could execute
the process in parallel with one core taking one direction for the branch
and the another core taking the other branch direction. *Once the correct
branch is determined, the bad execution path is discarded. *If the primary
core had the good execution path, it just continues execution. *If the
alternate core had the good execution path, it's internal state could be
"pushed" to the primary core. *If they used static ram for the internal
state, then it could be "pushed" asynchronously, i.e., between clocks or
sub-clocks. *It would require reserving a core for the branch path
execution, at least temporarily.

Rod Pemberton


Search on "speculative execution". It can also be done at the compiler
level; that doesn't require processor support. Intel's P6 was their
first chip to support it iirc.
  #3  
Old January 11th 12, 01:48 PM posted to comp.lang.forth,comp.sys.intel,comp.arch
[email protected]
external usenet poster
 
Posts: 17
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

In article ,
Alex McDonald wrote:

Search on "speculative execution". It can also be done at the compiler
level; that doesn't require processor support. Intel's P6 was their
first chip to support it iirc.


But don't believe any page that says that it is a solution to the
problem of memory latency (on cache misses). It isn't, and won't
ever be, for any 'normal' language and application. That's a MUCH
harder task :-(


Regards,
Nick Maclaren.
  #4  
Old January 11th 12, 02:33 PM posted to comp.lang.forth,comp.sys.intel,comp.arch
Terje Mathisen[_3_]
external usenet poster
 
Posts: 23
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

Rod Pemberton wrote:
"Arnold wrote in message
...
....

[Forth code]


CPU pipelining is improved by reducing conditionals. Modern CPUs have
branch prediction, but these aren't always successful, in which case the
pipline needs to be flushed, lowering the CPU's throughput.


Is that still true for multiple cores?

I.e., I would think the following is entirely possible and plausible, but I
haven't studied a CPU design in decades. The CPU's designers could execute
the process in parallel with one core taking one direction for the branch
and the another core taking the other branch direction. Once the correct
branch is determined, the bad execution path is discarded. If the primary
core had the good execution path, it just continues execution. If the
alternate core had the good execution path, it's internal state could be
"pushed" to the primary core. If they used static ram for the internal
state, then it could be "pushed" asynchronously, i.e., between clocks or
sub-clocks. It would require reserving a core for the branch path
execution, at least temporarily.


That's "Eager Execution" (EE), something which good branch prediction
makes even less of a win: You can never go through more than 2-3
branches like this before the power loss becomes totally horrendous
(running 4-8 threads in parallel, 75-88% wasted resources), while lots
of current code can branch predict past 6-8 branches before the
probability of a branch miss passes 50%, and all those branches could
actually be inside the pipeline at once.

EE should only be considered as a last resource for branches where the
runtime branch history shows very low predictability, i.e. closer to 50%
branch misses.

BTW, back in March 1995 I visited Intel in Santa Clara, in order to meet
some people from the P6 team and to get an engineering sample PentiumPro
machine.

I had to sign an NDA up front, so before doing that I insisted that I'd
be allowed to write down on a whiteboard everything I then knew or had
been able to guess about the architecture, based on nothing but public
sources. I.e. this was to properly delineate what the NDA could cover.

It turned out that I was correct in every particular, except for one
thing: I expected EE because I didn't realize how much they had been
able to improve the branch predictor.

The engineers on the other side of the table were very happy, while some
of the management and legal people seemed quite worried. :-)

Terje

--
- Terje.Mathisen at tmsw.no
"almost all programming can be viewed as an exercise in caching"
  #5  
Old January 11th 12, 04:03 PM posted to comp.lang.forth,comp.sys.intel,comp.arch
Arnold Doray
external usenet poster
 
Posts: 1
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

On Wed, 11 Jan 2012 07:21:53 -0500, Rod Pemberton wrote:

"Arnold Doray" wrote in message
...
...

[Forth code]


CPU pipelining is improved by reducing conditionals. Modern CPUs have
branch prediction, but these aren't always successful, in which case
the pipline needs to be flushed, lowering the CPU's throughput.


Is that still true for multiple cores?

I.e., I would think the following is entirely possible and plausible,
but I haven't studied a CPU design in decades. The CPU's designers
could execute the process in parallel with one core taking one direction
for the branch and the another core taking the other branch direction.
Once the correct branch is determined, the bad execution path is
discarded. If the primary core had the good execution path, it just
continues execution. If the alternate core had the good execution path,
it's internal state could be "pushed" to the primary core. If they used
static ram for the internal state, then it could be "pushed"
asynchronously, i.e., between clocks or sub-clocks. It would require
reserving a core for the branch path execution, at least temporarily.


Rod Pemberton


I don't think this is how it's done on modern CPUs. They use branch
prediction, which can be as good as 70% - 80% right,if I recall
correctly. When the wrong branch is taken, the entire pipeline needs to
be flushed. That's the reason to avoid conditionals.

But let's say for argument's sake that you're right. There are a few
problems with your scheme:

1) It's slow: because you're tying up another core. So you effectively
have one core less than previously.

2) Also, the switching overhead may be large if it has to be done often.

3) What happens with code containing multiple branches? You can't follow
them all. What to do?

4) Let's suppose you have lots of cores so that (3) is not a big issue.
Still, the number of instructions before the next branching could be less
than your pipeline's depth. So you're wasting the advantage offered by a
deep pipeline.

Thinking about Gavino's questio again, it occurs to me that with
predictive branching, a long loop will almost never flush the pipeline
(because the prediction is done by keeping statistics of the historical
branch choices). So this reason alone (pipeline flushing) is not a good
reason avoid using loops. Perhaps CM's comments were made when the
prediction was not as sophisticated as it is now.

Of course, avoiding conditionals entirely will give a faster solution.
How much faster is the question. Anyone has comments/experience with this?

Cheers,
Arnold
  #6  
Old January 11th 12, 06:25 PM posted to comp.lang.forth,comp.sys.intel,comp.arch
Andy (Super) Glew
external usenet poster
 
Posts: 6
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

On 1/11/2012 4:29 AM, Alex McDonald wrote:
On Jan 11, 12:21 pm, "Rod
wrote:
"Arnold wrote in message

...
...

[Forth code]


CPU pipelining is improved by reducing conditionals. Modern CPUs have
branch prediction, but these aren't always successful, in which case the
pipline needs to be flushed, lowering the CPU's throughput.


Is that still true for multiple cores?

I.e., I would think the following is entirely possible and plausible, but I
haven't studied a CPU design in decades. The CPU's designers could execute
the process in parallel with one core taking one direction for the branch
and the another core taking the other branch direction. Once the correct
branch is determined, the bad execution path is discarded. If the primary
core had the good execution path, it just continues execution. If the
alternate core had the good execution path, it's internal state could be
"pushed" to the primary core. If they used static ram for the internal
state, then it could be "pushed" asynchronously, i.e., between clocks or
sub-clocks. It would require reserving a core for the branch path
execution, at least temporarily.

Rod Pemberton


Search on "speculative execution". It can also be done at the compiler
level; that doesn't require processor support. Intel's P6 was their
first chip to support it iirc.


Taking both paths on a branch is currently called "eager execution".

Branch prediction is one form of "speculative execution".

P6 did speculative, but not eager. As, for that matter, did P5 (Pentium).

I don't know of anyone doing full eager execution, although it has been
studied out the bejeezus. Branch prediction usually beats it. I
believe an IBM chip did eager ifetch - fetching both sides of a branch -
but did not actually execute, stalled at decoder or therabouts.
  #7  
Old January 13th 12, 06:47 PM posted to comp.lang.forth,comp.sys.intel,comp.arch
Joe keane
external usenet poster
 
Posts: 3
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

In article ,
Andy (Super) Glew wrote:
I believe an IBM chip did eager ifetch - fetching both sides of a
branch - but did not actually execute, stalled at decoder or
therabouts.


Doing some 'early' loads seems like a good idea, assuming that you have
some IF units free.

But that is a *lot* different from the other suggestion, copying the
machine state to a different core.
  #8  
Old January 13th 12, 07:09 PM posted to comp.lang.forth,comp.sys.intel,comp.arch
Andy (Super) Glew
external usenet poster
 
Posts: 6
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

On 1/13/2012 9:47 AM, Joe keane wrote:
In ,
Andy (Super) wrote:
I believe an IBM chip did eager ifetch - fetching both sides of a
branch - but did not actually execute, stalled at decoder or
therabouts.


Doing some 'early' loads seems like a good idea, assuming that you have
some IF units free.

But that is a *lot* different from the other suggestion, copying the
machine state to a different core.


Microarchitectures proposed for eager execution:

* using the same pipeline(s) as ordinary non-eager
** e.g. if stalled on the main path, execute on the alternative path
** DEE: e.g. if have gone too far down one path, execute on an
alternative path that has higher relative probability of success than
predicting yet another branch on the current path.

Actually, the DEE technique applies to all forms of eager execution.

* using eager threads on the same CPU core
** although not necessarily the same pipeline(s) - eager threading was
one of the techniques I wanted to investigate on top of my multicluster
multithreading (MCMT) substrate -- although in my experience eager
nearly always loses out to non-eager forms of speculation.

Note that this is not necessarily the same pipeline resources, since in
my terminology an MCMT machine like Bulldozer is really a single CPU
with multiple execution clusters.

* using eager threads of execution on different CPU cores.

Problem is that eager threads seldom last long, so it is hard for eager
threads to overcome the cost of migrating to another core.
  #9  
Old January 14th 12, 05:16 AM posted to comp.lang.forth,comp.sys.intel,comp.arch
Joe keane
external usenet poster
 
Posts: 3
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

In article ,
Andy (Super) Glew wrote:
Problem is that eager threads seldom last long, so it is hard for eager
threads to overcome the cost of migrating to another core.


Essentially creating a new thread, for every -branch-, falls into the
'that don't sound right'.
  #10  
Old January 14th 12, 10:23 AM posted to comp.lang.forth,comp.sys.intel,comp.arch
Terje Mathisen[_3_]
external usenet poster
 
Posts: 23
Default Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?

Joe keane wrote:
In ,
Andy (Super) wrote:
Problem is that eager threads seldom last long, so it is hard for eager
threads to overcome the cost of migrating to another core.


Essentially creating a new thread, for every -branch-, falls into the
'that don't sound right'.


I've thought about this in the context "execute alternating instructions
from both sides of the branch, adding implicit predicates to them", so
that when the branch retires the predicates become known and half the
instructions are cancelled.

The problem is of course that you need to fetch from two paths at the
same time, i.e. you need a pair of virtualized IP registers as well.

The first criterium must be that you only ever do this for code that has
been executed a number of times, and where the branch history has turned
out to be very unpredictable.

Another idea could be to start with the predicted branch, then switch to
the other on the first stall/cache miss or new branch.

Terje
--
- Terje.Mathisen at tmsw.no
"almost all programming can be viewed as an exercise in caching"
 




Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Upgrade Report [Step-By-Step: Tune Up Your Windows XP PC - 05/25/2004] Ablang General 0 June 30th 04 03:08 AM
Anyone give me a step by step to setting up raid 0? Mike Barnard Gigabyte Motherboards 6 April 12th 04 10:06 AM
Newbie question - online step by step??? Moldy Overclocking AMD Processors 0 December 8th 03 11:56 AM
INCREDIBLE ! I can't tune the FSB step by step on an A7N8X-X !!! ASUS a liar ? GreatShot Asus Motherboards 3 August 2nd 03 07:25 PM


All times are GMT +1. The time now is 12:08 PM.


Powered by vBulletin® Version 3.6.4
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 HardwareBanter.
The comments are property of their posters.