## **Modern Front-end Support in** gem5 **Bhargav Reddy Godala, Nayana Prasad Nagendra, Ishita** Chaturvedi, Simone Campanoni, David I. August



PRINCETON UNIVERSITY



Liberty Research Group



Arcana **Research Group** 

#### Introduction

- Modern CPUs employ decoupled front-end to tolerate instruction miss latency.
- What is a decoupled front-end?

• We have seen that aggressive Out-of-Order CPUs tolerate data miss latency.

#### **State-of-Art Front-end**



- **FTQ:** Fetch Target Queue
- **IFU:** Instruction Fetch Unit
- **BPU**: Branch Prediction Unit
- **IAG:** Instruction Address Generation
- **NIP:** Next Instruction Pointer

#### **Traditional Front-end**

EMISSARY, Nagendra and Godala, et al.

#### State-of-Art Front-end



- **FTQ:** Fetch Target Queue
- **IFU:** Instruction Fetch Unit
- **BPU**: Branch Prediction Unit
- **IAG:** Instruction Address Generation
- **NIP:** Next Instruction Pointer

#### Key Idea: Prefetch in the predicted path

Fetch Directed Instruction Prefetching Pipeline (FDIP) [Glenn Reinman et al., MICRO'99]

EMISSARY, Nagendra and Godala, et al.

Design

## Challenges in Implementing FDIP in gem5

- Fetch stage is already complex.
- Dynamic Instruction objects are constructed before BPU is invoked.
  - Branch Instruction is needed to invoke BPU.
- Sequence numbers are used to squash mis-speculated instructions.

## **Branch Sequence Numbers**

- Unique sequence number to identify branch.
- Every dynamic instruction contains:
  - A sequence number
  - Branch Sequence of prior branch

| BrSeq<br>Seq | Seq | Instruction |
|--------------|-----|-------------|
| 10           | 100 | Br          |
| 10           | 101 | 10          |
| 10           | 102 | 1           |
| 10           | 103 | 12          |
| 11           | 104 | Br          |
| 11           | 105 | 13          |
| 11           | 106 | 14          |
| 11           | 107 | 15          |

### Fetch Target Queue (FTQ)

- Each entry consists of:
  - A begin address (target of prior branch)
  - End address (branch PC)
  - Target address
  - Branch Sequence number







## **Prefetch Engine**

- Prefetch Buffer:
  - Address to prefetch
  - Issue one prefetch and insert into Fetch Buffer





#### **Modified Fetch Stage**

#### •••

FTQ.pop() else: FTQ.flush() else:

```
if instruction is control then:
If address == FTQ.head.branchPC:
    PC = FTQ.head.target
```

```
PC += instruction.size()
```

# Optimizations

#### **Basic Block Based BTB**



| Index | Traget  |
|-------|---------|
| br1   | target1 |
| br2   | target2 |
| br3   | Target3 |
|       |         |

PC based BTB

| Index   | Target  | Branch |
|---------|---------|--------|
| target1 | target2 | br2    |
| target2 | target3 | br3    |
|         |         |        |
|         |         |        |

BBL based BTB

## **Pre-decode And Early Correction**

- BBL BTB are indexed using beginning of a basic block.
- Beginning of a basic block is identified:
  - Using the next instruction following a branch instruction.
- Early Correction:
  - When an unconditional branch is predicted not taken.
  - Flush FTQ and restart by using the pre-decoded target.

## **Branch Predictor Changes**

- BBL Based Branch Predictor lookup.
- Branch Sequence numbers.
- ITTAGE indirect predictor support.

#### X86 vs ARM

- X86:
  - Variable width instructions
  - Pre-decoding is very expensive
  - Micro Sequenced Ops
  - Exception handling using ROM

- ARM:
  - Fixed width instructions
  - Pre-decoding is not expensive

## **Micro Branches in X86**

- In X86 there are instructions which are dynamically decoded to loops.
  - Example: String copy
- These branches are not inserted into BTB.
- This is handled as a special case:
  - These are not seen by the FDIP pipeline.
  - At the time of fetch; a back edge is predicted taken.
  - FTQ will not be flushed till a squash from later stages is received.



#### Performance Bug Fixes

- Perfect recovery of branch history.
- TAGE Bimodal table roll back.



# Evaluation

#### Performance of ARM workloads with FDIP

| Field                  | Alderlake like    |  |
|------------------------|-------------------|--|
| ISA                    | ARM 64-bit        |  |
| L1I                    | 32KB              |  |
| L1D                    | 64KB              |  |
| L2                     | 1MB (16-way)      |  |
| L3                     | 2MB               |  |
| FTQ                    | 24 entry 192 inst |  |
| Width                  | 8-wide            |  |
| ROB Size               | 512 entries       |  |
| IQ/LQ/SQ               | 240/128/72        |  |
| BPU                    | TAGE, ITTAGE      |  |
| BTB                    | 16K entries       |  |
| gem5 O3 CPU simulation |                   |  |

% IPC Improvement

gem5 O3 CPU simulation parameters

90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% tapian agle-http://www.tapian.chimper 0.00% specilob

IPC Performance improvement of ARM workloads in % over No FDIP baseline



#### Performance of X86 workloads with FDIP

| Field                   | Alderlake like    |             |
|-------------------------|-------------------|-------------|
| ISA                     | X86 64-bit        |             |
| L1I                     | 32KB              | λt          |
| L1D                     | 64KB              | ner         |
| L2                      | 1MB (16-way)      | ver         |
| L3                      | 2MB               | Improvement |
| FTQ                     | 24 entry 192 inst | lm          |
| Width                   | 8-wide            | РС          |
| OB Size                 | 512 entries       | % IPC       |
| /LQ/SQ                  | 240/128/72        | 0           |
| BPU                     | TAGE, ITTAGE      |             |
| BTB                     | 16K entries       |             |
| aem5 03 CPLL simulation |                   |             |

gem5 O3 CPU simulation parameters



IPC Performance improvement of X86 workloads in % over No FDIP baseline

#### Performance of X86 SPEC17 workloads with FDIP

| Field                  | Alderlake like    |  |
|------------------------|-------------------|--|
| ISA                    | X86 64-bit        |  |
| L1I                    | 32KB              |  |
| L1D                    | 64KB              |  |
| L2                     | 1MB (16-way)      |  |
| L3                     | 2MB               |  |
| FTQ                    | 24 entry 192 inst |  |
| Width                  | 8-wide            |  |
| ROB Size               | 512 entries       |  |
| IQ/LQ/SQ               | 240/128/72        |  |
| BPU                    | TAGE, ITTAGE      |  |
| BTB                    | 16K entries       |  |
| gem5 O3 CPU simulation |                   |  |

gem5 O3 CPU simulation parameters



IPC Performance improvement of X86 SPEC17 workloads in % over No FDIP baseline



## **Published Works**

- Caching at ISCA'23
- Session 2B

#### EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction

## Conclusion

- We implemented FDIP in gem5.
- A significant speedup over baseline.
- This work was used in EMISSARY [ISCA'23].
- Available at <u>https://github.com/PrincetonUniversity/gem5\_FDIP</u>
- Workloads: <u>https://tinyurl.com/yjsc2aw4</u>





# Thank you Questions?