If this is your first visit, be sure to check out the FAQ by clicking the link above. You may have to register before you can post: click the register link above to proceed. To start viewing messages, select the forum that you want to visit from the selection below. |
|
|
|
Thread Tools | Display Modes |
#1
|
|||
|
|||
Can someone explain step by step how one avoid many conditional in forth as described in Moore Fourth essay?
"Arnold Doray" wrote in message
... .... [Forth code] CPU pipelining is improved by reducing conditionals. Modern CPUs have branch prediction, but these aren't always successful, in which case the pipline needs to be flushed, lowering the CPU's throughput. Is that still true for multiple cores? I.e., I would think the following is entirely possible and plausible, but I haven't studied a CPU design in decades. The CPU's designers could execute the process in parallel with one core taking one direction for the branch and the another core taking the other branch direction. Once the correct branch is determined, the bad execution path is discarded. If the primary core had the good execution path, it just continues execution. If the alternate core had the good execution path, it's internal state could be "pushed" to the primary core. If they used static ram for the internal state, then it could be "pushed" asynchronously, i.e., between clocks or sub-clocks. It would require reserving a core for the branch path execution, at least temporarily. Rod Pemberton |
#2
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
On Jan 11, 12:21*pm, "Rod Pemberton"
wrote: "Arnold Doray" wrote in message ... ... [Forth code] CPU pipelining is improved by reducing conditionals. Modern CPUs have branch prediction, but these aren't always successful, in which case the pipline needs to be flushed, lowering the CPU's throughput. Is that still true for multiple cores? I.e., I would think the following is entirely possible and plausible, but I haven't studied a CPU design in decades. *The CPU's designers could execute the process in parallel with one core taking one direction for the branch and the another core taking the other branch direction. *Once the correct branch is determined, the bad execution path is discarded. *If the primary core had the good execution path, it just continues execution. *If the alternate core had the good execution path, it's internal state could be "pushed" to the primary core. *If they used static ram for the internal state, then it could be "pushed" asynchronously, i.e., between clocks or sub-clocks. *It would require reserving a core for the branch path execution, at least temporarily. Rod Pemberton Search on "speculative execution". It can also be done at the compiler level; that doesn't require processor support. Intel's P6 was their first chip to support it iirc. |
#3
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
In article ,
Alex McDonald wrote: Search on "speculative execution". It can also be done at the compiler level; that doesn't require processor support. Intel's P6 was their first chip to support it iirc. But don't believe any page that says that it is a solution to the problem of memory latency (on cache misses). It isn't, and won't ever be, for any 'normal' language and application. That's a MUCH harder task :-( Regards, Nick Maclaren. |
#4
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
Rod Pemberton wrote:
"Arnold wrote in message ... .... [Forth code] CPU pipelining is improved by reducing conditionals. Modern CPUs have branch prediction, but these aren't always successful, in which case the pipline needs to be flushed, lowering the CPU's throughput. Is that still true for multiple cores? I.e., I would think the following is entirely possible and plausible, but I haven't studied a CPU design in decades. The CPU's designers could execute the process in parallel with one core taking one direction for the branch and the another core taking the other branch direction. Once the correct branch is determined, the bad execution path is discarded. If the primary core had the good execution path, it just continues execution. If the alternate core had the good execution path, it's internal state could be "pushed" to the primary core. If they used static ram for the internal state, then it could be "pushed" asynchronously, i.e., between clocks or sub-clocks. It would require reserving a core for the branch path execution, at least temporarily. That's "Eager Execution" (EE), something which good branch prediction makes even less of a win: You can never go through more than 2-3 branches like this before the power loss becomes totally horrendous (running 4-8 threads in parallel, 75-88% wasted resources), while lots of current code can branch predict past 6-8 branches before the probability of a branch miss passes 50%, and all those branches could actually be inside the pipeline at once. EE should only be considered as a last resource for branches where the runtime branch history shows very low predictability, i.e. closer to 50% branch misses. BTW, back in March 1995 I visited Intel in Santa Clara, in order to meet some people from the P6 team and to get an engineering sample PentiumPro machine. I had to sign an NDA up front, so before doing that I insisted that I'd be allowed to write down on a whiteboard everything I then knew or had been able to guess about the architecture, based on nothing but public sources. I.e. this was to properly delineate what the NDA could cover. It turned out that I was correct in every particular, except for one thing: I expected EE because I didn't realize how much they had been able to improve the branch predictor. The engineers on the other side of the table were very happy, while some of the management and legal people seemed quite worried. :-) Terje -- - Terje.Mathisen at tmsw.no "almost all programming can be viewed as an exercise in caching" |
#5
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
On Wed, 11 Jan 2012 07:21:53 -0500, Rod Pemberton wrote:
"Arnold Doray" wrote in message ... ... [Forth code] CPU pipelining is improved by reducing conditionals. Modern CPUs have branch prediction, but these aren't always successful, in which case the pipline needs to be flushed, lowering the CPU's throughput. Is that still true for multiple cores? I.e., I would think the following is entirely possible and plausible, but I haven't studied a CPU design in decades. The CPU's designers could execute the process in parallel with one core taking one direction for the branch and the another core taking the other branch direction. Once the correct branch is determined, the bad execution path is discarded. If the primary core had the good execution path, it just continues execution. If the alternate core had the good execution path, it's internal state could be "pushed" to the primary core. If they used static ram for the internal state, then it could be "pushed" asynchronously, i.e., between clocks or sub-clocks. It would require reserving a core for the branch path execution, at least temporarily. Rod Pemberton I don't think this is how it's done on modern CPUs. They use branch prediction, which can be as good as 70% - 80% right,if I recall correctly. When the wrong branch is taken, the entire pipeline needs to be flushed. That's the reason to avoid conditionals. But let's say for argument's sake that you're right. There are a few problems with your scheme: 1) It's slow: because you're tying up another core. So you effectively have one core less than previously. 2) Also, the switching overhead may be large if it has to be done often. 3) What happens with code containing multiple branches? You can't follow them all. What to do? 4) Let's suppose you have lots of cores so that (3) is not a big issue. Still, the number of instructions before the next branching could be less than your pipeline's depth. So you're wasting the advantage offered by a deep pipeline. Thinking about Gavino's questio again, it occurs to me that with predictive branching, a long loop will almost never flush the pipeline (because the prediction is done by keeping statistics of the historical branch choices). So this reason alone (pipeline flushing) is not a good reason avoid using loops. Perhaps CM's comments were made when the prediction was not as sophisticated as it is now. Of course, avoiding conditionals entirely will give a faster solution. How much faster is the question. Anyone has comments/experience with this? Cheers, Arnold |
#6
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
On 1/11/2012 4:29 AM, Alex McDonald wrote:
On Jan 11, 12:21 pm, "Rod wrote: "Arnold wrote in message ... ... [Forth code] CPU pipelining is improved by reducing conditionals. Modern CPUs have branch prediction, but these aren't always successful, in which case the pipline needs to be flushed, lowering the CPU's throughput. Is that still true for multiple cores? I.e., I would think the following is entirely possible and plausible, but I haven't studied a CPU design in decades. The CPU's designers could execute the process in parallel with one core taking one direction for the branch and the another core taking the other branch direction. Once the correct branch is determined, the bad execution path is discarded. If the primary core had the good execution path, it just continues execution. If the alternate core had the good execution path, it's internal state could be "pushed" to the primary core. If they used static ram for the internal state, then it could be "pushed" asynchronously, i.e., between clocks or sub-clocks. It would require reserving a core for the branch path execution, at least temporarily. Rod Pemberton Search on "speculative execution". It can also be done at the compiler level; that doesn't require processor support. Intel's P6 was their first chip to support it iirc. Taking both paths on a branch is currently called "eager execution". Branch prediction is one form of "speculative execution". P6 did speculative, but not eager. As, for that matter, did P5 (Pentium). I don't know of anyone doing full eager execution, although it has been studied out the bejeezus. Branch prediction usually beats it. I believe an IBM chip did eager ifetch - fetching both sides of a branch - but did not actually execute, stalled at decoder or therabouts. |
#7
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
In article ,
Andy (Super) Glew wrote: I believe an IBM chip did eager ifetch - fetching both sides of a branch - but did not actually execute, stalled at decoder or therabouts. Doing some 'early' loads seems like a good idea, assuming that you have some IF units free. But that is a *lot* different from the other suggestion, copying the machine state to a different core. |
#8
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
On 1/13/2012 9:47 AM, Joe keane wrote:
In , Andy (Super) wrote: I believe an IBM chip did eager ifetch - fetching both sides of a branch - but did not actually execute, stalled at decoder or therabouts. Doing some 'early' loads seems like a good idea, assuming that you have some IF units free. But that is a *lot* different from the other suggestion, copying the machine state to a different core. Microarchitectures proposed for eager execution: * using the same pipeline(s) as ordinary non-eager ** e.g. if stalled on the main path, execute on the alternative path ** DEE: e.g. if have gone too far down one path, execute on an alternative path that has higher relative probability of success than predicting yet another branch on the current path. Actually, the DEE technique applies to all forms of eager execution. * using eager threads on the same CPU core ** although not necessarily the same pipeline(s) - eager threading was one of the techniques I wanted to investigate on top of my multicluster multithreading (MCMT) substrate -- although in my experience eager nearly always loses out to non-eager forms of speculation. Note that this is not necessarily the same pipeline resources, since in my terminology an MCMT machine like Bulldozer is really a single CPU with multiple execution clusters. * using eager threads of execution on different CPU cores. Problem is that eager threads seldom last long, so it is hard for eager threads to overcome the cost of migrating to another core. |
#9
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
In article ,
Andy (Super) Glew wrote: Problem is that eager threads seldom last long, so it is hard for eager threads to overcome the cost of migrating to another core. Essentially creating a new thread, for every -branch-, falls into the 'that don't sound right'. |
#10
|
|||
|
|||
Can someone explain step by step how one avoid many conditionalin forth as described in Moore Fourth essay?
Joe keane wrote:
In , Andy (Super) wrote: Problem is that eager threads seldom last long, so it is hard for eager threads to overcome the cost of migrating to another core. Essentially creating a new thread, for every -branch-, falls into the 'that don't sound right'. I've thought about this in the context "execute alternating instructions from both sides of the branch, adding implicit predicates to them", so that when the branch retires the predicates become known and half the instructions are cancelled. The problem is of course that you need to fetch from two paths at the same time, i.e. you need a pair of virtualized IP registers as well. The first criterium must be that you only ever do this for code that has been executed a number of times, and where the branch history has turned out to be very unpredictable. Another idea could be to start with the predicted branch, then switch to the other on the first stall/cache miss or new branch. Terje -- - Terje.Mathisen at tmsw.no "almost all programming can be viewed as an exercise in caching" |
|
Thread Tools | |
Display Modes | |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Upgrade Report [Step-By-Step: Tune Up Your Windows XP PC - 05/25/2004] | Ablang | General | 0 | June 30th 04 03:08 AM |
Anyone give me a step by step to setting up raid 0? | Mike Barnard | Gigabyte Motherboards | 6 | April 12th 04 10:06 AM |
Newbie question - online step by step??? | Moldy | Overclocking AMD Processors | 0 | December 8th 03 11:56 AM |
INCREDIBLE ! I can't tune the FSB step by step on an A7N8X-X !!! ASUS a liar ? | GreatShot | Asus Motherboards | 3 | August 2nd 03 07:25 PM |