CPU

The B32P3 is a 32-bit RISC CPU designed from scratch for the FPGC. It's the third iteration of the B32P design, optimized to run at 100 MHz on a Cyclone IV FPGA. The architecture follows a classic 5-stage MIPS-style pipeline.

Architecture Overview

The CPU has 16 general-purpose 32-bit registers (r0 is hardwired to zero), a 256-entry hardware stack, and a 32-bit word-addressable address space. It runs at a single clock frequency of 100 MHz with no clock gating or dynamic frequency scaling.

The pipeline has five stages:

IF (Instruction Fetch): Reads the next instruction from ROM or L1I cache
ID (Instruction Decode): Decodes instruction fields, reads register file
EX (Execute): ALU operations, branch condition evaluation
MEM (Memory Access): Load/store through L1D cache, VRAM, or I/O
WB (Write Back): Writes results back to the register file

In ideal conditions (cache hits, no hazards, no branches), the CPU executes one instruction per clock cycle. In practice, stalls from cache misses, multi-cycle ALU operations (like division), and pipeline hazards reduce the throughput.

Instruction Set

The ISA has 16 instructions, all 32 bits wide. There are no variable-length instructions or instruction modes. The opcode is always in the top 4 bits.

Instruction Encoding

         |31|30|29|28|27|26|25|24|23|22|21|20|19|18|17|16|15|14|13|12|11|10|09|08|07|06|05|04|03|02|01|00|
----------------------------------------------------------------------------------------------------------
 HALT      1  1  1  1| 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 READ      1  1  1  0||----------------16 BIT CONSTANT---------------||--A REG---| x  x  x  x |--D REG---|
 WRITE     1  1  0  1||----------------16 BIT CONSTANT---------------||--A REG---||--B REG---| x  x  x  x
 INTID     1  1  0  0| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x |--D REG---|
 PUSH      1  0  1  1| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x |--B REG---| x  x  x  x
 POP       1  0  1  0| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x |--D REG---|
 JUMP      1  0  0  1||--------------------------------27 BIT CONSTANT--------------------------------||O|
 JUMPR     1  0  0  0||----------------16 BIT CONSTANT---------------| x  x  x  x |--B REG---| x  x  x |O|
 CCACHE    0  1  1  1| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x
 BRANCH    0  1  1  0||----------------16 BIT CONSTANT---------------||--A REG---||--B REG---||-OPCODE||S|
 SAVPC     0  1  0  1| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x |--D REG---|
 RETI      0  1  0  0| x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x  x
 ARITHMC   0  0  1  1||--OPCODE--||----------------16 BIT CONSTANT---------------||--A REG---||--D REG---|
 ARITHM    0  0  1  0||--OPCODE--| x  x  x  x  x  x  x  x  x  x  x  x |--A REG---||--B REG---||--D REG---|
 ARITHC    0  0  0  1||--OPCODE--||----------------16 BIT CONSTANT---------------||--A REG---||--D REG---|
 ARITH     0  0  0  0||--OPCODE--| x  x  x  x  x  x  x  x  x  x  x  x |--A REG---||--B REG---||--D REG---|

The instruction set is split into four categories:

Control flow: HALT, JUMP, JUMPR, BRANCH, SAVPC, RETI

Memory access: READ (load), WRITE (store), PUSH, POP

Arithmetic/Logic (single-cycle): ARITH and ARITHC use the combinational ALU. ARITHC takes a 16-bit immediate instead of a second register.

Arithmetic/Logic (multi-cycle): ARITHM and ARITHMC use the multi-cycle ALU for multiplication, division, and modulo. Division takes about 32 cycles.

Miscellaneous: INTID (get interrupt ID), CCACHE (clear all caches)

ALU Operations (Single-Cycle)

Opcode	Operation	Description
`0000`	OR	Bitwise OR
`0001`	AND	Bitwise AND
`0010`	XOR	Bitwise XOR
`0011`	ADD	Addition
`0100`	SUB	Subtraction
`0101`	SHIFTL	Logical shift left
`0110`	SHIFTR	Logical shift right
`0111`	NOT	Bitwise NOT (of A)
`1010`	SLT	Set if A < B (signed)
`1011`	SLTU	Set if A < B (unsigned)
`1100`	LOAD	Load B (or constant)
`1101`	LOADHI	Load upper 16 bits: `{const16, A[15:0]}`
`1110`	SHIFTRS	Arithmetic shift right

ALU Operations (Multi-Cycle)

Opcode	Operation	Description	Cycles
`0000`	MULTS	Signed multiply	~4
`0001`	MULTU	Unsigned multiply	~4
`0010`	MULTFP	Fixed-point multiply (Q16.16)	~4
`0011`	DIVS	Signed divide	~32
`0100`	DIVU	Unsigned divide	~32
`0101`	DIVFP	Fixed-point divide	~32
`0110`	MODS	Signed modulo	~32
`0111`	MODU	Unsigned modulo	~32

Branch Conditions

Opcode	Condition	Signed variant (S=1)
`000`	BEQ (A == B)	Same
`001`	BGT (A > B)	BGTS
`010`	BGE (A >= B)	BGES
`100`	BNE (A != B)	Same
`101`	BLT (A < B)	BLTS
`110`	BLE (A <= B)	BLES

Registers

16 registers, r0 hardwired to zero:

Register	Notes
r0	Always zero. Writes are ignored.
r1 through r14	General purpose
r15	General purpose. Conventionally used for return values.

Hardware Stack

The CPU has a 256-entry hardware stack with dedicated PUSH and POP instructions. The stack is used primarily for saving/restoring registers during function calls and interrupt handlers. The stack pointer wraps around at 256, so pushing beyond that will overwrite old entries silently.

The stack pointer is readable and writable as a CPU-internal I/O register at 0x7C00001, which is useful for context switching or debugging.

Memory Map

All memory and I/O is mapped into a flat 27-bit address space. The CPU starts execution at the ROM address (0x7800000).

Address Range	Region	Size	Description
`0x0000000` - `0x6FFFFFF`	SDRAM	112 MiW	Main working memory, accessed through L1I/L1D caches
`0x7000000` - `0x700001B`	I/O	28 words	UART, SPI, Timers, GPIO, etc.
`0x7800000` - `0x78003FF`	ROM	1 KiW	Boot ROM (also the initial PC value)
`0x7900000` - `0x790041F`	VRAM32	32-bit entries	Tile patterns and palettes
`0x7A00000` - `0x7A02001`	VRAM8	8-bit entries	Tile maps, scroll registers
`0x7B00000` - `0x7B12BFF`	VRAMpixel	8-bit entries	320x240 pixel framebuffer (external SRAM)
`0x7C00000` - `0x7C00001`	CPU Internal I/O	2 words	PC Backup (`0x00`), Stack Pointer (`0x01`)

SDRAM is the main working memory. It's accessed through L1 instruction (L1I) and data (L1D) caches, so most reads complete in a single cycle on cache hits. Only SDRAM and ROM can be used as instruction memory.

The VRAM regions are on-chip dual-port block RAM (VRAM32 and VRAM8) or external SRAM (VRAMpixel) and are accessed in a single cycle without caching. They are used by the GPU for rendering.

I/O devices are accessed through the Memory Unit, which is a separate module that handles SPI, UART, timers, and other peripherals. I/O accesses stall the pipeline until complete.

Interrupts

The CPU supports 8 interrupt lines, priority-encoded (lower index = higher priority). Interrupts are edge-triggered with CDC synchronization.

When an interrupt fires: 1. The current PC is saved to PC_backup (readable/writable at 0x7C00000) 2. Interrupts are disabled (no nesting) 3. PC jumps to address 0x0000001 (the interrupt handler)

The handler uses INTID to determine which interrupt fired, handles it, then executes RETI to restore the PC and re-enable interrupts.

An important constraint: interrupts only fire when a jump or branch is being taken in the MEM stage. This greatly simplifies pipeline hazard handling during interrupt delivery, at the cost of slightly delayed interrupt response. In practice, most code has enough jumps (function calls, loops) that the latency is negligible.