Branch Prediction Schemes

There are many methods to deal with the pipeline stalls caused by branch delay. We discuss four simple compile-time schemes in which predictions are static - they are fixed for each branch during the entire execution, and the predictions are compile-time guesses.
Stall pipeline
Predict taken
Predict not taken
Delayed branch

Stall pipeline

The simplest scheme to handle branches is to freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known.
Advantage: simple both to software and hardware (solution described earlier)
 

Predict Not Taken

A higher performance, and only slightly more complex, scheme is to predict the branch as not taken, simply allowing the hardware to continue as if the branch were not executed. Care must be taken not to change the machine state until the branch outcome is definitely known.

The complexity arises from:
we have to know when the state might be changed by an instruction;
we have to know how to "back out" a change.

The pipeline with this scheme implemented behaves as shown below:
 

Untaken Branch Instr IF ID EX MEM WB    
Instr i+1   IF ID EX MEM WB   
Instr i+2     IF ID EX MEM WB
 
Taken Branch Instr IF ID EX MEM WB      
Instr i+1   IF idle idle idle idle    
Branch target     IF ID EX MEM WB  
Branch target+1       IF ID EX MEM WB
When branch is not taken, determined during ID, we have fetched the fall-through and just continue. If the branch is taken during ID, we restart the fetch at the branch target. This causes all instructions following the branch to stall one clock cycle.
 

Predict Taken

An alternative scheme is to predict the branch as taken. As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target address.

Because in DLX pipeline the target address is not known any earlier than the branch outcome, there is no advantage in this approach. In some machines where the target address is known before the branch outcome a predict-taken scheme might make sense.
 

Delayed Branch

In a delayed branch, the execution cycle with a branch delay of length n is

Branch instr
sequential successor 1
sequential successor 2
. . . . .
sequential successor n
Branch target if taken
Sequential successors are in the branch-delay slots. These instructions are executed whether or not the branch is taken.

The pipeline behavior of the DLX pipeline, which has one branch delay slot is shown below:
 
Untaken branch instr IF ID EX MEM WB        
Branch delay instr(i+1)   IF ID EX MEM WB      
Instr i+2     IF ID EX MEM WB    
Instr i+3       IF ID EX MEM WB  
Instr i+4         IF ID  EX MEM WB
 
Taken branch instr IF ID EX MEM WB        
Branch delay instr(i+1)   IF ID EX MEM WB      
Branch target     IF ID  EX MEM WB    
Branch target+1       IF ID EX MEM WB  
Branch target+2         IF ID EX MEM WB

The job of the compiler is to make the successor instructions valid and useful.
We will show three branch-scheduling schemes:

From before branch
From target
From fall through
Scheduling the branch-delay slot.

The left box in each pair shows the code before scheduling, the right box shows the scheduled code

In (a) the delay slot is scheduled with an independent instruction from before the branch. This is the best choice.

Strategies (b) and (c) are used when (a) is not possible.

In the code sequences for (b) and (c), the use of R1 in the branch condition prevents the ADD instruction (whose destination is R1) from being moved after the branch.

In (b) the branch-delay slot is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another path.

In (c) the branch-delay slot is scheduled from the not-taken fall through

To make this optimazation legal for (b) and (c), it must be OK to execute the SUB instruction when the branch goes in the unexpected direction. OK means that the work might be wasted but the program will still execute correctly.


 
Scheduling strategy Requirements When Improves Performance
From before branch Branch must not depend on the rescheduled instructions Always
From target Must be OK to execute rescheduled instructions if branch is not taken When branch is taken. May enlarge program if instructions are duplicated
From fall though Must be OK to execute instructions if branch is taken When branch is not taken

The limitations on delayed-branch scheduling arise from
the restrictions on the instructions that are scheduled into the delay slots and
our ability to predict at compile time whether a branch is likely to be taken or not.
 

Cancelling Branch

To improve the ability of the compiler to fill branch delay slots, most machines with conditional branches have introduced a cancelling branch. In a cancelling branch the instruction includes the direction that the branch was predicted.
- if the branch behaves as predicted, the instruction in the branch delay slot is fully executed;
- if the branch is incorrectly predicted, the instruction in the delay slot is turned into no-op(idle).
 
The behavior of a predicted-taken cancelling branch depends on whether the branch is taken or not:
Untaken branch instr IF ID EX MEM WB        
Branch delay instr(i+1)   IF ID idle idle idle      
Instr i+2     IF ID EX MEM WB    
Instr i+3       IF ID EX MEM WB  
Instr i+4         IF ID  EX MEM WB
 
Taken branch instr IF ID EX MEM WB        
Branch delay instr(i+1)   IF ID EX MEM WB      
Branch target     IF ID  EX MEM WB    
Branch target+1       IF ID EX MEM WB  
Branch target+2         IF ID EX MEM WB

The advantage of cancelling branches is that they eliminate the requirements on the instruction placed in the delay slot.

Delayed branches are an architecturally visible feature of the pipeline. This is the source both of their advantage - allowing the use of simple compiler scheduling to reduce branch penalties; and
their disadvantage - exposing an aspect of the implementation that is likely to change.