Skip to content

CPU

The B32P3 is a 32-bit RISC CPU designed from scratch for the FPGC. It's the third iteration of the B32P design, optimized to run at 100 MHz on a Cyclone IV FPGA. The architecture follows a classic 5-stage MIPS-style pipeline.

Architecture Overview

The CPU has 16 general-purpose 32-bit registers (r0 is hardwired to zero), a 256-entry hardware stack, and a 32-bit word-addressable address space. It runs at a single clock frequency of 100 MHz with no clock gating or dynamic frequency scaling.

The pipeline has five stages:

  1. IF (Instruction Fetch): Reads the next instruction from ROM or L1I cache
  2. ID (Instruction Decode): Decodes instruction fields, reads register file
  3. EX (Execute): ALU operations, branch condition evaluation
  4. MEM (Memory Access): Load/store through L1D cache, VRAM, or I/O
  5. WB (Write Back): Writes results back to the register file

In ideal conditions (cache hits, no hazards, no branches), the CPU executes one instruction per clock cycle. In practice, stalls from cache misses, multi-cycle ALU operations (like division), and pipeline hazards reduce the throughput.

Instruction Set

The ISA has 16 instructions, all 32 bits wide. There are no variable-length instructions or instruction modes. The opcode is always in the top 4 bits.

Instruction Encoding

         |31|30|29|28|27|26|25|24|23|22|21|20|19|18|17|16|15|14|13|12|11|10|09|08|07|06|05|04|03|02|01|00|
----------------------------------------------------------------------------------------------------------
 HALT      1  1  1  1| 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 READ      1  1  1  0||----------------16 BIT CONSTANT---------------||--A REG---| x  x  x  x |--D REG---|
 WRITE     1  1  0  1||----------------16 BIT CONSTANT---------------||--A REG---||--B REG---| x  x  x  x
 INTID     1  1  0  0| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x |--D REG---|
 PUSH      1  0  1  1| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x |--B REG---| x  x  x  x
 POP       1  0  1  0| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x |--D REG---|
 JUMP      1  0  0  1||--------------------------------27 BIT CONSTANT--------------------------------||O|
 JUMPR     1  0  0  0||----------------16 BIT CONSTANT---------------| x  x  x  x |--B REG---| x  x  x |O|
 CCACHE    0  1  1  1| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x
 BRANCH    0  1  1  0||----------------16 BIT CONSTANT---------------||--A REG---||--B REG---||-OPCODE||S|
 SAVPC     0  1  0  1| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x |--D REG---|
 RETI      0  1  0  0| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x
 ARITHMC   0  0  1  1||--OPCODE--||----------------16 BIT CONSTANT---------------||--A REG---||--D REG---|
 ARITHM    0  0  1  0||--OPCODE--| x  x  x  x  x  x  x  x  x  x  x  x |--A REG---||--B REG---||--D REG---|
 ARITHC    0  0  0  1||--OPCODE--||----------------16 BIT CONSTANT---------------||--A REG---||--D REG---|
 ARITH     0  0  0  0||--OPCODE--| x  x  x  x  x  x  x  x  x  x  x  x |--A REG---||--B REG---||--D REG---|

The instruction set is split into four categories:

Control flow: HALT, JUMP, JUMPR, BRANCH, SAVPC, RETI

Memory access: READ (load), WRITE (store), PUSH, POP

Arithmetic/Logic (single-cycle): ARITH and ARITHC use the combinational ALU. ARITHC takes a 16-bit immediate instead of a second register.

Arithmetic/Logic (multi-cycle): ARITHM and ARITHMC use the multi-cycle ALU for multiplication, division, and modulo. Division takes about 32 cycles.

Miscellaneous: INTID (get interrupt ID), CCACHE (clear all caches)

ALU Operations (Single-Cycle)

Opcode Operation Description
0000 OR Bitwise OR
0001 AND Bitwise AND
0010 XOR Bitwise XOR
0011 ADD Addition
0100 SUB Subtraction
0101 SHIFTL Logical shift left
0110 SHIFTR Logical shift right
0111 NOT Bitwise NOT (of A)
1010 SLT Set if A < B (signed)
1011 SLTU Set if A < B (unsigned)
1100 LOAD Load B (or constant)
1101 LOADHI Load upper 16 bits: {const16, A[15:0]}
1110 SHIFTRS Arithmetic shift right

ALU Operations (Multi-Cycle)

Opcode Operation Description Cycles
0000 MULTS Signed multiply ~4
0001 MULTU Unsigned multiply ~4
0010 MULTFP Fixed-point multiply (Q16.16) ~4
0011 DIVS Signed divide ~32
0100 DIVU Unsigned divide ~32
0101 DIVFP Fixed-point divide ~32
0110 MODS Signed modulo ~32
0111 MODU Unsigned modulo ~32

Branch Conditions

Opcode Condition Signed variant (S=1)
000 BEQ (A == B) Same
001 BGT (A > B) BGTS
010 BGE (A >= B) BGES
100 BNE (A != B) Same
101 BLT (A < B) BLTS
110 BLE (A <= B) BLES

Registers

16 registers, r0 hardwired to zero:

Register Notes
r0 Always zero. Writes are ignored.
r1 through r14 General purpose
r15 General purpose. Conventionally used for return values.

Hardware Stack

The CPU has a 256-entry hardware stack with dedicated PUSH and POP instructions. The stack is used primarily for saving/restoring registers during function calls and interrupt handlers. The stack pointer wraps around at 256, so pushing beyond that will overwrite old entries silently.

The stack pointer is readable and writable as a CPU-internal I/O register at 0x7C00001, which is useful for context switching or debugging.

Memory Map

All memory and I/O is mapped into a flat 27-bit address space. The CPU starts execution at the ROM address (0x7800000).

Address Range Region Size Description
0x0000000 - 0x6FFFFFF SDRAM 112 MiW Main working memory, accessed through L1I/L1D caches
0x7000000 - 0x700001B I/O 28 words UART, SPI, Timers, GPIO, etc.
0x7800000 - 0x78003FF ROM 1 KiW Boot ROM (also the initial PC value)
0x7900000 - 0x790041F VRAM32 32-bit entries Tile patterns and palettes
0x7A00000 - 0x7A02001 VRAM8 8-bit entries Tile maps, scroll registers
0x7B00000 - 0x7B12BFF VRAMpixel 8-bit entries 320x240 pixel framebuffer (external SRAM)
0x7C00000 - 0x7C00001 CPU Internal I/O 2 words PC Backup (0x00), Stack Pointer (0x01)

SDRAM is the main working memory. It's accessed through L1 instruction (L1I) and data (L1D) caches, so most reads complete in a single cycle on cache hits. Only SDRAM and ROM can be used as instruction memory.

The VRAM regions are on-chip dual-port block RAM (VRAM32 and VRAM8) or external SRAM (VRAMpixel) and are accessed in a single cycle without caching. They are used by the GPU for rendering.

I/O devices are accessed through the Memory Unit, which is a separate module that handles SPI, UART, timers, and other peripherals. I/O accesses stall the pipeline until complete.

Interrupts

The CPU supports 8 interrupt lines, priority-encoded (lower index = higher priority). Interrupts are edge-triggered with CDC synchronization.

When an interrupt fires: 1. The current PC is saved to PC_backup (readable/writable at 0x7C00000) 2. Interrupts are disabled (no nesting) 3. PC jumps to address 0x0000001 (the interrupt handler)

The handler uses INTID to determine which interrupt fired, handles it, then executes RETI to restore the PC and re-enable interrupts.

An important constraint: interrupts only fire when a jump or branch is being taken in the MEM stage. This greatly simplifies pipeline hazard handling during interrupt delivery, at the cost of slightly delayed interrupt response. In practice, most code has enough jumps (function calls, loops) that the latency is negligible.