CPU
The B32P3 is a 32-bit RISC CPU designed from scratch for the FPGC. It's the third iteration of the B32P design, optimized to run at 100 MHz on a Cyclone IV FPGA. The architecture follows a classic 5-stage MIPS-style pipeline.
Architecture Overview
The CPU has 16 general-purpose 32-bit registers (r0 is hardwired to zero), a 256-entry hardware stack, and a byte-addressable address space. It runs at a single clock frequency of 100 MHz with no clock gating or dynamic frequency scaling.
The pipeline has five stages:
- IF (Instruction Fetch): Reads the next instruction from ROM or L1I cache
- ID (Instruction Decode): Decodes instruction fields, reads register file
- EX (Execute): ALU operations, branch condition evaluation
- MEM (Memory Access): Load/store through L1D cache, VRAM, or I/O
- WB (Write Back): Writes results back to the register file
In ideal conditions (cache hits, no hazards, no branches), the CPU executes one instruction per clock cycle. In practice, stalls from cache misses, multi-cycle ALU operations (like division), and pipeline hazards reduce the throughput.
Instruction Set
The ISA has 16 instructions, all 32 bits wide. There are no variable-length instructions or instruction modes. The opcode is always in the top 4 bits.
Instruction Encoding
|31|30|29|28|27|26|25|24|23|22|21|20|19|18|17|16|15|14|13|12|11|10|09|08|07|06|05|04|03|02|01|00|
----------------------------------------------------------------------------------------------------------
HALT 1 1 1 1| 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
READ 1 1 1 0||----------------16 BIT CONSTANT---------------||--A REG---||-RD SUBOP-||--D REG---|
WRITE 1 1 0 1||----------------16 BIT CONSTANT---------------||--A REG---||--B REG---||-WR SUBOP|
INTID 1 1 0 0| x x x x x x x x x x x x x x x x x x x x x x x x |--D REG---|
PUSH 1 0 1 1| x x x x x x x x x x x x x x x x x x x x |--B REG---| x x x x
POP 1 0 1 0| x x x x x x x x x x x x x x x x x x x x x x x x |--D REG---|
JUMP 1 0 0 1||--------------------------------27 BIT CONSTANT--------------------------------||O|
JUMPR 1 0 0 0||----------------16 BIT CONSTANT---------------| x x x x |--B REG---| x x x |O|
CCACHE 0 1 1 1| x x x x x x x x x x x x x x x x x x x x x x x x x x x x
BRANCH 0 1 1 0||----------------16 BIT CONSTANT---------------||--A REG---||--B REG---||-OPCODE||S|
SAVPC 0 1 0 1| x x x x x x x x x x x x x x x x x x x x x x x x |--D REG---|
RETI 0 1 0 0| x x x x x x x x x x x x x x x x x x x x x x x x x x x x
ARITHMC 0 0 1 1||--OPCODE--||----------------16 BIT CONSTANT---------------||--A REG---||--D REG---|
ARITHM 0 0 1 0||--OPCODE--| x x x x x x x x x x x x |--A REG---||--B REG---||--D REG---|
ARITHC 0 0 0 1||--OPCODE--||----------------16 BIT CONSTANT---------------||--A REG---||--D REG---|
ARITH 0 0 0 0||--OPCODE--| x x x x x x x x x x x x |--A REG---||--B REG---||--D REG---|
The instruction set is split into four categories:
Control flow: HALT, JUMP, JUMPR, BRANCH, SAVPC, RETI
Memory access: READ (load word/halfword/byte), WRITE (store word/halfword/byte), PUSH, POP
Arithmetic/Logic (single-cycle): ARITH and ARITHC use the combinational ALU. ARITHC takes a 16-bit immediate instead of a second register.
Arithmetic/Logic (multi-cycle): ARITHM and ARITHMC use the multi-cycle ALU for multiplication, division, and modulo. Division takes about 32 cycles.
Miscellaneous: INTID (get interrupt ID), CCACHE (clear all caches)
READ/WRITE Sub-Opcodes
The READ and WRITE instructions support sub-word access through a 4-bit sub-opcode field. When the sub-opcode is 0000, the instruction performs a standard 32-bit word operation (backward compatible). Other encodings select byte or halfword access:
READ sub-opcode (bits [7:4]):
| Sub-opcode | Mnemonic | Description |
|---|---|---|
0000 |
read |
Load 32-bit word |
0001 |
readb |
Load byte, sign-extend to 32 bits |
0101 |
readbu |
Load byte, zero-extend to 32 bits |
0010 |
readh |
Load halfword, sign-extend to 32 bits |
0110 |
readhu |
Load halfword, zero-extend to 32 bits |
WRITE sub-opcode (bits [3:0]):
| Sub-opcode | Mnemonic | Description |
|---|---|---|
0000 |
write |
Store 32-bit word |
0001 |
writeb |
Store byte (low 8 bits of register) |
0010 |
writeh |
Store halfword (low 16 bits of register) |
Word loads/stores require 4-byte alignment, halfword loads/stores require 2-byte alignment, and byte loads/stores have no alignment requirement. Unaligned accesses silently ignore the low bits (the hardware does not raise an exception).
ALU Operations (Single-Cycle)
| Opcode | Operation | Description |
|---|---|---|
0000 |
OR | Bitwise OR |
0001 |
AND | Bitwise AND |
0010 |
XOR | Bitwise XOR |
0011 |
ADD | Addition |
0100 |
SUB | Subtraction |
0101 |
SHIFTL | Logical shift left |
0110 |
SHIFTR | Logical shift right |
0111 |
NOT | Bitwise NOT (of A) |
1010 |
SLT | Set if A < B (signed) |
1011 |
SLTU | Set if A < B (unsigned) |
1100 |
LOAD | Load B (or constant) |
1101 |
LOADHI | Load upper 16 bits: {const16, A[15:0]} |
1110 |
SHIFTRS | Arithmetic shift right |
ALU Operations (Multi-Cycle)
| Opcode | Operation | Description | Cycles |
|---|---|---|---|
0000 |
MULTS | Signed multiply | ~4 |
0001 |
MULTU | Unsigned multiply | ~4 |
0010 |
MULTFP | Fixed-point multiply (Q16.16) | ~4 |
0011 |
DIVS | Signed divide | ~32 |
0100 |
DIVU | Unsigned divide | ~32 |
0101 |
DIVFP | Fixed-point divide | ~32 |
0110 |
MODS | Signed modulo | ~32 |
0111 |
MODU | Unsigned modulo | ~32 |
1000 |
FMUL | FP64 fixed-point multiply (Q32.32) | ~5 |
1001 |
FADD | FP64 fixed-point add (Q32.32) | 1 |
1010 |
FSUB | FP64 fixed-point subtract (Q32.32) | 1 |
1011 |
FLD | Load FP64 register from two CPU registers | 1 |
1100 |
FSTHI | Store FP64 register high word to CPU register | 1 |
1101 |
FSTLO | Store FP64 register low word to CPU register | 1 |
1110 |
MULSHI | Signed multiply, return upper 32 bits | ~4 |
1111 |
MULTUHI | Unsigned multiply, return upper 32 bits | ~4 |
The FP64 operations (1000–1101) use the FP64 coprocessor described below. MULSHI and MULTUHI return the upper 32 bits of a 64-bit multiply result, which is useful for implementing fixed-point arithmetic without the coprocessor.
Branch Conditions
| Opcode | Condition | Signed variant (S=1) |
|---|---|---|
000 |
BEQ (A == B) | Same |
001 |
BGT (A > B) | BGTS |
010 |
BGE (A >= B) | BGES |
100 |
BNE (A != B) | Same |
101 |
BLT (A < B) | BLTS |
110 |
BLE (A <= B) | BLES |
Registers
16 registers, r0 hardwired to zero:
| Register | Notes |
|---|---|
| r0 | Always zero. Writes are ignored. |
| r1 through r14 | General purpose |
| r15 | General purpose. Conventionally used for return values. |
Hardware Stack
The CPU has a 256-entry hardware stack with dedicated PUSH and POP instructions. The stack is used primarily for saving/restoring registers during function calls and interrupt handlers. The stack pointer wraps around at 256, so pushing beyond that will overwrite old entries silently.
The stack pointer is readable and writable as a CPU-internal I/O register at 0x1F000004, which is useful for context switching or debugging.
Memory Map
All memory and I/O is mapped into a flat byte-addressed space. The CPU starts execution at the ROM address (0x1E000000).
| Address Range | Region | Size | Description |
|---|---|---|---|
0x0000000 - 0x03FFFFFF |
SDRAM | 64 MiB | Main working memory, accessed through L1I/L1D caches |
0x1C000000 - 0x1C00006F |
I/O | 28 registers | UART, SPI, Timers, GPIO, etc. |
0x1E000000 - 0x1E000FFF |
ROM | 4 KiB (1 KiW) | Boot ROM (also the initial PC value) |
0x1E400000 - 0x1E40107C |
VRAM32 | 32-bit entries | Tile patterns and palettes |
0x1E800000 - 0x1E808004 |
VRAM8 | 8-bit entries | Tile maps, scroll registers |
0x1EC00000 - 0x1EC4AFFC |
VRAMpixel | 8-bit entries | 320x240 pixel framebuffer (external SRAM) |
0x1F000000 - 0x1F000004 |
CPU Internal I/O | 2 registers | PC Backup (0x00), Stack Pointer (0x04) |
SDRAM is the main working memory. It's accessed through L1 instruction (L1I) and data (L1D) caches, so most reads complete in a single cycle on cache hits. Only SDRAM and ROM can be used as instruction memory.
The VRAM regions are on-chip dual-port block RAM (VRAM32 and VRAM8) or external SRAM (VRAMpixel) and are accessed in a single cycle without caching. They are used by the GPU for rendering.
I/O devices are accessed through the Memory Unit, which is a separate module that handles SPI, UART, timers, and other peripherals. I/O accesses stall the pipeline until complete.
Interrupts
The CPU supports 8 interrupt lines, priority-encoded (lower index = higher priority). Interrupts are edge-triggered with CDC synchronization.
When an interrupt fires:
1. The current PC is saved to PC_backup (readable/writable at 0x1F000000)
2. Interrupts are disabled (no nesting)
3. PC jumps to address 0x0000004 (the interrupt handler, i.e., the second instruction)
The handler uses INTID to determine which interrupt fired, handles it, then executes RETI to restore the PC and re-enable interrupts.
Interrupt Assignments
| Bit | INT ID | Source | Description |
|---|---|---|---|
| 0 | 1 | UART RX | UART byte received |
| 1 | 2 | Timer 1 (OST1) | OS timer 1 |
| 2 | 3 | Timer 2 (OST2) | OS timer 2 (USB keyboard polling) |
| 3 | 4 | Timer 3 (OST3) | OS timer 3 (delay) |
| 4 | 5 | Frame Drawn | GPU vblank signal |
| 5 | 6 | ENC28J60 RX | Ethernet packet received (inverted ~INT pin) |
| 6 | 7 | (unused) | — |
| 7 | 8 | (unused) | — |
An important constraint: interrupts only fire when a jump or branch is being taken in the MEM stage. This greatly simplifies pipeline hazard handling during interrupt delivery, at the cost of slightly delayed interrupt response. In practice, most code has enough jumps (function calls, loops) that the latency is negligible.
FP64 Coprocessor
The FP64 coprocessor extends the B32P3 with 64-bit fixed-point arithmetic using a signed Q32.32 format. It has its own register file of 8 × 64-bit registers (f0–f7), separate from the CPU's general-purpose registers.
Q32.32 Fixed-Point Format
Each 64-bit FP register stores a value in signed Q32.32 format:
- hi (bits 63–32): Signed 32-bit integer part
- lo (bits 31–0): Unsigned 32-bit fractional part (representing a value in the range [0, 1))
The represented value is: value = hi + lo / 2^32
For example, to represent 3.75: hi = 3, lo = 0xC0000000 (0.75 × 2^32).
FP64 Register File
| Register | Width | Description |
|---|---|---|
| f0–f7 | 64 bits | General-purpose FP64 registers |
There is no hardwired zero register. All 8 registers are freely usable. The register index is encoded in 3 bits within the ARITHM instruction fields (A, B, and D register fields, using only the lower 3 bits).
FP64 Instructions
All FP64 instructions use the ARITHM encoding (opcode 0010). The alu_op field (bits 27–24) selects the operation. Register fields A, B, and D are reused: for FP64 operations, only the lower 3 bits index into the FP64 register file.
| Instruction | alu_op | Operands | Description |
|---|---|---|---|
| FMUL | 1000 |
fd, fa, fb | fd = fa × fb (Q32.32 multiply, ~5 cycles) |
| FADD | 1001 |
fd, fa, fb | fd = fa + fb (Q32.32 add, 1 cycle) |
| FSUB | 1010 |
fd, fa, fb | fd = fa - fb (Q32.32 subtract, 1 cycle) |
| FLD | 1011 |
fd, areg, breg | fd = {areg, breg} (load hi/lo from CPU registers) |
| FSTHI | 1100 |
dreg, fa | dreg = fa[63:32] (store hi word to CPU register) |
| FSTLO | 1101 |
dreg, fa | dreg = fa[31:0] (store lo word to CPU register) |
FLD loads an FP64 register from two CPU registers: areg provides the high (integer) word and breg provides the low (fractional) word. FSTHI and FSTLO extract the high or low 32-bit half of an FP64 register into a CPU register, which is the only way to move data from the FP64 register file back to the CPU.
FADD and FSUB complete in a single cycle. FMUL takes approximately 5 cycles because it uses DSP multiplier blocks and accumulates partial products. During multi-cycle FP64 operations, the pipeline stalls the same way as for regular multi-cycle ALU operations.