Dual-Issue Snitch¶
This proposal makes Snitch and its FPU pseudo-dual-issue in a very lightweight manner. It is not truly dual issue because Snitch still issues only one instruction per cycle, but in-FPU instruction repetition (enabled by regularity of SSRs) allows the int and float pipelines to run in parallel.
frep
Instruction¶
The reduction in the matrix-vector, and matrix-matrix product looks as follows (using SSRs, assuming 8 elements, inner dimension unrolled, no hwloops to emphasize effect):
la a0, input_A
la a1, input_B
la a3, SSR_CFG
li a4, 16 # number of outer iterations
loop_start:
sw a0, 0(a3) # configure SSR0, set data pointer
sw a1, 4(a3) # configure SSR1, set data pointer
fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0
addi a4, a4, -1 # adjust loop counter
add a0, a0, s0 # adjust address for outer dimension
add a1, a1, s0 # adjust address for outer dimension
bgtz a4, loop_start
That’s 8 FPU ops and 6 int ops; peak FPU util. of 57%, even with SSRs (would be horrible without – 36%). However the FPU ops are very regular thanks to SSR. The FPU has a register to hold the current op anyway; let’s add a way to sequence that issue into the pipeline multiple times:
la a0, input_A
la a1, input_B
la a3, SSR_CFG
li a4, 16
loop_start:
sw a0, 0(a3)
sw a1, 4(a3)
frep 8, 1 # repeat next FPU inst 8x
fmadd fa0, ft0, ft1, fa0
addi a4, a4, -1
add a0, a0, s0
add a1, a1, s0
bgtz a4, loop_start
This now executes as follows (program and int pipeline left, FPU pipeline right):
Core | FPU
------------------------ | ------------------------
la a0, input_A | <idle>
la a1, input_B | <idle>
la a3, SSR_CFG | <idle>
li a4, 16 | <idle>
sw a0, 0(a3) | <idle>
sw a1, 4(a3) | <idle>
frep 8, 1 | frep 8, 1
fmadd fa0, ft0, ft1, fa0 | fmadd fa0, ft0, ft1, fa0 # L0
addi a4, a4, -1 | fmadd fa0, ft0, ft1, fa0
add a0, a0, s0 | fmadd fa0, ft0, ft1, fa0
add a1, a1, s0 | fmadd fa0, ft0, ft1, fa0
bgtz a4, loop_start | fmadd fa0, ft0, ft1, fa0
sw a0, 0(a3) | fmadd fa0, ft0, ft1, fa0
sw a1, 4(a3) | fmadd fa0, ft0, ft1, fa0
frep 8, 1 | fmadd fa0, ft0, ft1, fa0
fmadd fa0, ft0, ft1, fa0 | fmadd fa0, ft0, ft1, fa0 # L1
addi a4, a4, -1 | fmadd fa0, ft0, ft1, fa0
add a0, a0, s0 | fmadd fa0, ft0, ft1, fa0
add a1, a1, s0 | fmadd fa0, ft0, ft1, fa0
[...] | [...]
Notice how the int pipeline continues with the program while the FPU operates. FPU utilization now at 100%. In fact we believe that we can get 100% for many kernels, as long as it has more float instructions than integer instructions.
Register Staggering¶
Simply repeating the same instruction causes lots of data hazards in the FPU pipeline. For example an accumulation:
frep 4, 1 # repeat next inst 4x
fmadd fa0, ft0, ft1, fa0
Assuming two cycles of latency for fmadd
, this pipelines as follows:
fmadd fa0, ft0, ft1, fa0
<bubble>
<bubble>
fmadd fa0, ft0, ft1, fa0
<bubble>
<bubble>
fmadd fa0, ft0, ft1, fa0
<bubble>
<bubble>
fmadd fa0, ft0, ft1, fa0
Register staggering increments the register arguments with each iteration:
frep 4, 1, 0b1001 # repeat next inst 4x, inc rd and rs3
fmadd fa0, ft0, ft1, fa0
Unrolls to:
fmadd fa0, ft0, ft1, fa0
fmadd fa1, ft0, ft1, fa1
fmadd fa2, ft0, ft1, fa2
fmadd fa3, ft0, ft1, fa3
Loop Buffer¶
Only the simplest kernels consist of a single FMA instruction in the innermost loop. The innermost iteration of the FFT kernel for example looks as follows:
fmul t2, t4, t0
fmul t3, t5, t0
fmsub t2, t5, t0, t2
fmadd t3, t4, t0, t3
fadd t1, t0, t2
fsub t1, t0, t2
fadd t1, t0, t3
fsub t1, t0, t3
We add an instruction buffer to the FPU, implemented as a latch-based ring buffer. A small counter in the FPU can then be used to repeat previous instructions. This enables microloops to run entirely in the FPU:
frep 256, 8 # repeat the next 8 inst 256x
fmul t2, t4, t0
fmul t3, t5, t0
fmsub t2, t5, t0, t2
fmadd t3, t4, t0, t3
fadd t1, t0, t2
fsub t1, t0, t2
fadd t1, t0, t3
fsub t1, t0, t3
This loop then runs in the FPU pipeline for the next 2048 cycles, during which the int pipeline is free to compute the next FFT address pattern. This is A LOT of time; and it can be used very well to implement the bit-flipping needed for the more interesting in-place FFT (memory optimal).
Hardware Cost¶
We have implemented this scheme as a block that sits between Snitch and the FPU. Sizes for different microloop depths:
- 8 instructions: 3.5 kGE
- 16 instructions: 5.6 kGE
Conclusion¶
This gives Snitch the same capability as NTX, while having improved flexiblity and programmability.
frep
Instruction Encoding¶
Arguments that need to be encoded into frep
:
is_outer
: 1 bitmax_inst
: immediate (up to 16 values)max_rep
: registerstagger_mask
: 4 bitstagger_count
: 3 bit
Mapping:
0..6
: opcodecustom
7
(rd[0]): is outer11..8
(rd[4..1]): stagger mask12..14
(rm): stagger max15..19
(rs1): max rep31..20
(imm12): max inst