# Dual-Issue Snitch

This proposal makes Snitch and its FPU *pseudo-dual-issue* in a very lightweight manner. It is not truly dual issue because Snitch still issues only one instruction per cycle, but in-FPU instruction repetition (enabled by regularity of SSRs) allows the int and float pipelines to run in parallel.

## `frep` Instruction

The reduction in the matrix-vector, and matrix-matrix product looks as follows (using SSRs, assuming 8 elements, inner dimension unrolled, no hwloops to emphasize effect):

      la a0, input_A
      la a1, input_B
      la a3, SSR_CFG
      li a4, 16  # number of outer iterations
    loop_start:
      sw a0, 0(a3)  # configure SSR0, set data pointer
      sw a1, 4(a3)  # configure SSR1, set data pointer
      fmadd fa0, ft0, ft1, fa0
      fmadd fa0, ft0, ft1, fa0
      fmadd fa0, ft0, ft1, fa0
      fmadd fa0, ft0, ft1, fa0
      fmadd fa0, ft0, ft1, fa0
      fmadd fa0, ft0, ft1, fa0
      fmadd fa0, ft0, ft1, fa0
      fmadd fa0, ft0, ft1, fa0
      addi a4, a4, -1  # adjust loop counter
      add a0, a0, s0  # adjust address for outer dimension
      add a1, a1, s0  # adjust address for outer dimension
      bgtz a4, loop_start

That's 8 FPU ops and 6 int ops; peak FPU util. of 57%, even with SSRs (would be horrible without -- 36%). However the FPU ops are very regular thanks to SSR. The FPU has a register to hold the current op anyway; let's add a way to sequence that issue into the pipeline multiple times:

      la a0, input_A
      la a1, input_B
      la a3, SSR_CFG
      li a4, 16
    loop_start:
      sw a0, 0(a3)
      sw a1, 4(a3)
      frep 8, 1  # repeat next FPU inst 8x
      fmadd fa0, ft0, ft1, fa0
      addi a4, a4, -1
      add a0, a0, s0
      add a1, a1, s0
      bgtz a4, loop_start

This now executes as follows (program and int pipeline left, FPU pipeline right):

    Core                      |  FPU
    ------------------------  |  ------------------------
    la a0, input_A            |  <idle>
    la a1, input_B            |  <idle>
    la a3, SSR_CFG            |  <idle>
    li a4, 16                 |  <idle>
    sw a0, 0(a3)              |  <idle>
    sw a1, 4(a3)              |  <idle>
    frep 8, 1                 |  frep 8, 1
    fmadd fa0, ft0, ft1, fa0  |  fmadd fa0, ft0, ft1, fa0  # L0
    addi a4, a4, -1           |  fmadd fa0, ft0, ft1, fa0
    add a0, a0, s0            |  fmadd fa0, ft0, ft1, fa0
    add a1, a1, s0            |  fmadd fa0, ft0, ft1, fa0
    bgtz a4, loop_start       |  fmadd fa0, ft0, ft1, fa0
    sw a0, 0(a3)              |  fmadd fa0, ft0, ft1, fa0
    sw a1, 4(a3)              |  fmadd fa0, ft0, ft1, fa0
    frep 8, 1                 |  fmadd fa0, ft0, ft1, fa0
    fmadd fa0, ft0, ft1, fa0  |  fmadd fa0, ft0, ft1, fa0  # L1
    addi a4, a4, -1           |  fmadd fa0, ft0, ft1, fa0
    add a0, a0, s0            |  fmadd fa0, ft0, ft1, fa0
    add a1, a1, s0            |  fmadd fa0, ft0, ft1, fa0
    [...]                     |  [...]

Notice how the int pipeline continues with the program while the FPU operates. FPU utilization now at 100%. In fact we believe that we can get 100% for many kernels, as long as it has more float instructions than integer instructions.

### Register Staggering

Simply repeating the same instruction causes lots of data hazards in the FPU pipeline. For example an accumulation:

    frep 4, 1  # repeat next inst 4x
    fmadd fa0, ft0, ft1, fa0

Assuming two cycles of latency for `fmadd`, this pipelines as follows:

    fmadd fa0, ft0, ft1, fa0
    <bubble>
    <bubble>
    fmadd fa0, ft0, ft1, fa0
    <bubble>
    <bubble>
    fmadd fa0, ft0, ft1, fa0
    <bubble>
    <bubble>
    fmadd fa0, ft0, ft1, fa0

Register staggering increments the register arguments with each iteration:

    frep 4, 1, 0b1001  # repeat next inst 4x, inc rd and rs3
    fmadd fa0, ft0, ft1, fa0

Unrolls to:

    fmadd fa0, ft0, ft1, fa0
    fmadd fa1, ft0, ft1, fa1
    fmadd fa2, ft0, ft1, fa2
    fmadd fa3, ft0, ft1, fa3

## Loop Buffer

Only the simplest kernels consist of a single FMA instruction in the innermost loop. The innermost iteration of the FFT kernel for example looks as follows:

    fmul   t2, t4, t0
    fmul   t3, t5, t0
    fmsub  t2, t5, t0, t2
    fmadd  t3, t4, t0, t3
    fadd   t1, t0, t2
    fsub   t1, t0, t2
    fadd   t1, t0, t3
    fsub   t1, t0, t3

We add an instruction buffer to the FPU, implemented as a latch-based ring buffer. A small counter in the FPU can then be used to repeat previous instructions. This enables **microloops** to run entirely in the FPU:

    frep   256, 8  # repeat the next 8 inst 256x
    fmul   t2, t4, t0
    fmul   t3, t5, t0
    fmsub  t2, t5, t0, t2
    fmadd  t3, t4, t0, t3
    fadd   t1, t0, t2
    fsub   t1, t0, t2
    fadd   t1, t0, t3
    fsub   t1, t0, t3

This loop then runs in the FPU pipeline for the next 2048 cycles, during which the int pipeline is free to compute the next FFT address pattern. This is A LOT of time; and it can be used very well to implement the bit-flipping needed for the more interesting in-place FFT (memory optimal).

## Hardware Cost

We have implemented this scheme as a block that sits between Snitch and the FPU. Sizes for different microloop depths:

- **8** instructions: 3.5 kGE
- **16** instructions: 5.6 kGE

## Conclusion

This gives Snitch the same capability as NTX, while having improved flexiblity and programmability.

## `frep` Instruction Encoding

Arguments that need to be encoded into `frep`:

- `is_outer`: 1 bit
- `max_inst`: immediate (up to 16 values)
- `max_rep`: register
- `stagger_mask`: 4 bit
- `stagger_count`: 3 bit

Mapping:

- `0..6`: opcode `custom`
- `7` (rd[0]): is outer
- `11..8` (rd[4..1]): stagger mask
- `12..14` (rm): stagger max
- `15..19` (rs1): max rep
- `31..20` (imm12): max inst