Dual-Issue Snitch

This proposal makes Snitch and its FPU pseudo-dual-issue in a very lightweight manner. It is not truly dual issue because Snitch still issues only one instruction per cycle, but in-FPU instruction repetition (enabled by regularity of SSRs) allows the int and float pipelines to run in parallel.

frep Instruction

The reduction in the matrix-vector, and matrix-matrix product looks as follows (using SSRs, assuming 8 elements, inner dimension unrolled, no hwloops to emphasize effect):

  la a0, input_A
  la a1, input_B
  la a3, SSR_CFG
  li a4, 16  # number of outer iterations
loop_start:
  sw a0, 0(a3)  # configure SSR0, set data pointer
  sw a1, 4(a3)  # configure SSR1, set data pointer
  fmadd fa0, ft0, ft1, fa0
  fmadd fa0, ft0, ft1, fa0
  fmadd fa0, ft0, ft1, fa0
  fmadd fa0, ft0, ft1, fa0
  fmadd fa0, ft0, ft1, fa0
  fmadd fa0, ft0, ft1, fa0
  fmadd fa0, ft0, ft1, fa0
  fmadd fa0, ft0, ft1, fa0
  addi a4, a4, -1  # adjust loop counter
  add a0, a0, s0  # adjust address for outer dimension
  add a1, a1, s0  # adjust address for outer dimension
  bgtz a4, loop_start

That’s 8 FPU ops and 6 int ops; peak FPU util. of 57%, even with SSRs (would be horrible without – 36%). However the FPU ops are very regular thanks to SSR. The FPU has a register to hold the current op anyway; let’s add a way to sequence that issue into the pipeline multiple times:

  la a0, input_A
  la a1, input_B
  la a3, SSR_CFG
  li a4, 16
loop_start:
  sw a0, 0(a3)
  sw a1, 4(a3)
  frep 8, 1  # repeat next FPU inst 8x
  fmadd fa0, ft0, ft1, fa0
  addi a4, a4, -1
  add a0, a0, s0
  add a1, a1, s0
  bgtz a4, loop_start

This now executes as follows (program and int pipeline left, FPU pipeline right):

Core                      |  FPU
------------------------  |  ------------------------
la a0, input_A            |  <idle>
la a1, input_B            |  <idle>
la a3, SSR_CFG            |  <idle>
li a4, 16                 |  <idle>
sw a0, 0(a3)              |  <idle>
sw a1, 4(a3)              |  <idle>
frep 8, 1                 |  frep 8, 1
fmadd fa0, ft0, ft1, fa0  |  fmadd fa0, ft0, ft1, fa0  # L0
addi a4, a4, -1           |  fmadd fa0, ft0, ft1, fa0
add a0, a0, s0            |  fmadd fa0, ft0, ft1, fa0
add a1, a1, s0            |  fmadd fa0, ft0, ft1, fa0
bgtz a4, loop_start       |  fmadd fa0, ft0, ft1, fa0
sw a0, 0(a3)              |  fmadd fa0, ft0, ft1, fa0
sw a1, 4(a3)              |  fmadd fa0, ft0, ft1, fa0
frep 8, 1                 |  fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0  |  fmadd fa0, ft0, ft1, fa0  # L1
addi a4, a4, -1           |  fmadd fa0, ft0, ft1, fa0
add a0, a0, s0            |  fmadd fa0, ft0, ft1, fa0
add a1, a1, s0            |  fmadd fa0, ft0, ft1, fa0
[...]                     |  [...]

Notice how the int pipeline continues with the program while the FPU operates. FPU utilization now at 100%. In fact we believe that we can get 100% for many kernels, as long as it has more float instructions than integer instructions.

Register Staggering

Simply repeating the same instruction causes lots of data hazards in the FPU pipeline. For example an accumulation:

frep 4, 1  # repeat next inst 4x
fmadd fa0, ft0, ft1, fa0

Assuming two cycles of latency for fmadd, this pipelines as follows:

fmadd fa0, ft0, ft1, fa0
<bubble>
<bubble>
fmadd fa0, ft0, ft1, fa0
<bubble>
<bubble>
fmadd fa0, ft0, ft1, fa0
<bubble>
<bubble>
fmadd fa0, ft0, ft1, fa0

Register staggering increments the register arguments with each iteration:

frep 4, 1, 0b1001  # repeat next inst 4x, inc rd and rs3
fmadd fa0, ft0, ft1, fa0

Unrolls to:

fmadd fa0, ft0, ft1, fa0
fmadd fa1, ft0, ft1, fa1
fmadd fa2, ft0, ft1, fa2
fmadd fa3, ft0, ft1, fa3

Loop Buffer

Only the simplest kernels consist of a single FMA instruction in the innermost loop. The innermost iteration of the FFT kernel for example looks as follows:

fmul   t2, t4, t0
fmul   t3, t5, t0
fmsub  t2, t5, t0, t2
fmadd  t3, t4, t0, t3
fadd   t1, t0, t2
fsub   t1, t0, t2
fadd   t1, t0, t3
fsub   t1, t0, t3

We add an instruction buffer to the FPU, implemented as a latch-based ring buffer. A small counter in the FPU can then be used to repeat previous instructions. This enables microloops to run entirely in the FPU:

frep   256, 8  # repeat the next 8 inst 256x
fmul   t2, t4, t0
fmul   t3, t5, t0
fmsub  t2, t5, t0, t2
fmadd  t3, t4, t0, t3
fadd   t1, t0, t2
fsub   t1, t0, t2
fadd   t1, t0, t3
fsub   t1, t0, t3

This loop then runs in the FPU pipeline for the next 2048 cycles, during which the int pipeline is free to compute the next FFT address pattern. This is A LOT of time; and it can be used very well to implement the bit-flipping needed for the more interesting in-place FFT (memory optimal).

Hardware Cost

We have implemented this scheme as a block that sits between Snitch and the FPU. Sizes for different microloop depths:

  • 8 instructions: 3.5 kGE
  • 16 instructions: 5.6 kGE

Conclusion

This gives Snitch the same capability as NTX, while having improved flexiblity and programmability.

frep Instruction Encoding

Arguments that need to be encoded into frep:

  • is_outer: 1 bit
  • max_inst: immediate (up to 16 values)
  • max_rep: register
  • stagger_mask: 4 bit
  • stagger_count: 3 bit

Mapping:

  • 0..6: opcode custom
  • 7 (rd[0]): is outer
  • 11..8 (rd[4..1]): stagger mask
  • 12..14 (rm): stagger max
  • 15..19 (rs1): max rep
  • 31..20 (imm12): max inst