# Dual-Issue Snitch This proposal makes Snitch and its FPU *pseudo-dual-issue* in a very lightweight manner. It is not truly dual issue because Snitch still issues only one instruction per cycle, but in-FPU instruction repetition (enabled by regularity of SSRs) allows the int and float pipelines to run in parallel. ## `frep` Instruction The reduction in the matrix-vector, and matrix-matrix product looks as follows (using SSRs, assuming 8 elements, inner dimension unrolled, no hwloops to emphasize effect): la a0, input_A la a1, input_B la a3, SSR_CFG li a4, 16 # number of outer iterations loop_start: sw a0, 0(a3) # configure SSR0, set data pointer sw a1, 4(a3) # configure SSR1, set data pointer fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 addi a4, a4, -1 # adjust loop counter add a0, a0, s0 # adjust address for outer dimension add a1, a1, s0 # adjust address for outer dimension bgtz a4, loop_start That's 8 FPU ops and 6 int ops; peak FPU util. of 57%, even with SSRs (would be horrible without -- 36%). However the FPU ops are very regular thanks to SSR. The FPU has a register to hold the current op anyway; let's add a way to sequence that issue into the pipeline multiple times: la a0, input_A la a1, input_B la a3, SSR_CFG li a4, 16 loop_start: sw a0, 0(a3) sw a1, 4(a3) frep 8, 1 # repeat next FPU inst 8x fmadd fa0, ft0, ft1, fa0 addi a4, a4, -1 add a0, a0, s0 add a1, a1, s0 bgtz a4, loop_start This now executes as follows (program and int pipeline left, FPU pipeline right): Core | FPU ------------------------ | ------------------------ la a0, input_A | la a1, input_B | la a3, SSR_CFG | li a4, 16 | sw a0, 0(a3) | sw a1, 4(a3) | frep 8, 1 | frep 8, 1 fmadd fa0, ft0, ft1, fa0 | fmadd fa0, ft0, ft1, fa0 # L0 addi a4, a4, -1 | fmadd fa0, ft0, ft1, fa0 add a0, a0, s0 | fmadd fa0, ft0, ft1, fa0 add a1, a1, s0 | fmadd fa0, ft0, ft1, fa0 bgtz a4, loop_start | fmadd fa0, ft0, ft1, fa0 sw a0, 0(a3) | fmadd fa0, ft0, ft1, fa0 sw a1, 4(a3) | fmadd fa0, ft0, ft1, fa0 frep 8, 1 | fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 | fmadd fa0, ft0, ft1, fa0 # L1 addi a4, a4, -1 | fmadd fa0, ft0, ft1, fa0 add a0, a0, s0 | fmadd fa0, ft0, ft1, fa0 add a1, a1, s0 | fmadd fa0, ft0, ft1, fa0 [...] | [...] Notice how the int pipeline continues with the program while the FPU operates. FPU utilization now at 100%. In fact we believe that we can get 100% for many kernels, as long as it has more float instructions than integer instructions. ### Register Staggering Simply repeating the same instruction causes lots of data hazards in the FPU pipeline. For example an accumulation: frep 4, 1 # repeat next inst 4x fmadd fa0, ft0, ft1, fa0 Assuming two cycles of latency for `fmadd`, this pipelines as follows: fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 fmadd fa0, ft0, ft1, fa0 Register staggering increments the register arguments with each iteration: frep 4, 1, 0b1001 # repeat next inst 4x, inc rd and rs3 fmadd fa0, ft0, ft1, fa0 Unrolls to: fmadd fa0, ft0, ft1, fa0 fmadd fa1, ft0, ft1, fa1 fmadd fa2, ft0, ft1, fa2 fmadd fa3, ft0, ft1, fa3 ## Loop Buffer Only the simplest kernels consist of a single FMA instruction in the innermost loop. The innermost iteration of the FFT kernel for example looks as follows: fmul t2, t4, t0 fmul t3, t5, t0 fmsub t2, t5, t0, t2 fmadd t3, t4, t0, t3 fadd t1, t0, t2 fsub t1, t0, t2 fadd t1, t0, t3 fsub t1, t0, t3 We add an instruction buffer to the FPU, implemented as a latch-based ring buffer. A small counter in the FPU can then be used to repeat previous instructions. This enables **microloops** to run entirely in the FPU: frep 256, 8 # repeat the next 8 inst 256x fmul t2, t4, t0 fmul t3, t5, t0 fmsub t2, t5, t0, t2 fmadd t3, t4, t0, t3 fadd t1, t0, t2 fsub t1, t0, t2 fadd t1, t0, t3 fsub t1, t0, t3 This loop then runs in the FPU pipeline for the next 2048 cycles, during which the int pipeline is free to compute the next FFT address pattern. This is A LOT of time; and it can be used very well to implement the bit-flipping needed for the more interesting in-place FFT (memory optimal). ## Hardware Cost We have implemented this scheme as a block that sits between Snitch and the FPU. Sizes for different microloop depths: - **8** instructions: 3.5 kGE - **16** instructions: 5.6 kGE ## Conclusion This gives Snitch the same capability as NTX, while having improved flexiblity and programmability. ## `frep` Instruction Encoding Arguments that need to be encoded into `frep`: - `is_outer`: 1 bit - `max_inst`: immediate (up to 16 values) - `max_rep`: register - `stagger_mask`: 4 bit - `stagger_count`: 3 bit Mapping: - `0..6`: opcode `custom` - `7` (rd[0]): is outer - `11..8` (rd[4..1]): stagger mask - `12..14` (rm): stagger max - `15..19` (rs1): max rep - `31..20` (imm12): max inst