DMA Engine

The FPGC includes a single-channel DMA controller, DMAengine, that moves data between SDRAM, the SPI flash/SD/Ethernet bursts, and the pixel framebuffer (VRAMPX) without involving the CPU pipeline.

It is implemented in Hardware/FPGA/Verilog/Modules/IO/DMAengine.v and instantiated from the top-level FPGC.v next to the SDRAM controller and the MemoryUnit.

Why

Three workloads dominated CPU cycles before the DMA existed:

Disk I/O — every BRFS sector read/write (512 bytes) was a tight per-byte SPI loop that stalled the CPU for milliseconds.
Ethernet packet RX/TX — copying packet payloads between the ENC28J60 SRAM and SDRAM through SPI4.
Framebuffer presents — full-frame memcpy of 76,800 bytes into VRAMPX, which costs ~18 ms via plain CPU stores and produced visible tearing because the GPU scans VRAMPX continuously.

The DMA engine offloads all three to dedicated hardware.

Register Block

The engine exposes a 6-register MMIO block in the I/O region. See Memory Map for the absolute addresses.

Offset	Name	R/W	Purpose
`0x70`	`DMA_SRC`	R/W	Source byte address (SDRAM for MEM2*, SPI for SPI2MEM)
`0x74`	`DMA_DST`	R/W	Destination byte address
`0x78`	`DMA_COUNT`	R/W	Byte count (must be > 0 and a multiple of 32)
`0x7C`	`DMA_CTRL`	R/W	Mode + flags + start (bit `[31]` is W1S start, self-clears)
`0x80`	`DMA_STATUS`	R	`{29'd0, sticky_error, sticky_done, busy}`
`0x84`	`DMA_QSPI_ADDR`	R/W	24-bit flash address for `SPI2MEM_QSPI` mode

DMA_CTRL layout:

Bits	Field	Meaning
`3:0`	`MODE`	0 = MEM2MEM, 1 = MEM2SPI, 2 = SPI2MEM, 3 = MEM2VRAM, 6 = SPI2MEM_QSPI
`4`	`IRQ_EN`	Raise interrupt 7 when the transfer completes
`7:5`	`SPI_ID`	SPI peripheral ID for SPI modes
`31`	`START` (W1S)	Writing 1 latches the registers and starts the engine

DMA_STATUS bits:

busy — high while the engine is transferring.
done — sticky; set when a transfer finishes successfully.
error — sticky; set on alignment violation or count == 0.

The sticky bits are cleared on the rising edge of a status read and when a new transfer is started.

Alignment Rules

Every transfer must satisfy:

DMA_SRC % 32 == 0
DMA_DST % 32 == 0
DMA_COUNT % 32 == 0 and DMA_COUNT > 0

For MEM2VRAM, DMA_DST must additionally lie inside the VRAMPX decode window (0x1EC00000 .. 0x1EC1FFFF), and DMA_DST + DMA_COUNT must not exceed it.

Misaligned values are rejected: the engine immediately enters the ERROR state, sets STATUS.error, and never touches memory.

Cache Coherency

The DMA engine reads SDRAM through the SDRAMarbiter's DMA port, which bypasses the CPU's L1 data cache. Software is responsible for cache coherency. Two rules:

Producer side (CPU writes the source, DMA reads it): invalidate/flush the L1d cache (ccached) before issuing the transfer, so any dirty lines are written back to SDRAM.
Consumer side (DMA writes the destination, CPU reads it): invalidate the L1d cache (ccached) after the transfer, so stale cached lines do not shadow the new data.

The libfpgc and userlib helpers (see below) bracket their synchronous transfers with ccached automatically. Asynchronous helpers leave this to the caller.

VRAMPX is not cached on the CPU side, so MEM2VRAM only needs the producer-side flush.

Modes

MEM2MEM

Plain SDRAM-to-SDRAM copy, in 32-byte cache-line bursts. Used as the fast memcpy primitive for large buffers.

MEM2SPI / SPI2MEM

Streams between SDRAM and a SPI peripheral via the SPI burst port, which feeds an internal TX/RX FIFO and drives SimpleSPI2. Used by the BRFS sector layer (SPI flash), the SD card driver, and the Ethernet driver (ENC28J60 packet RX/TX).

The selected SPI peripheral is chosen by SPI_ID in DMA_CTRL and must already be selected (CS low) by the driver before starting the transfer. Supported SPI IDs:

SPI_ID	Peripheral	Notes
0	SPI Flash 0	SimpleSPI2 instance, reads and writes
1	SPI Flash 1	QSPIflash — reads via SPI2MEM_QSPI only
4	ENC28J60	SimpleSPI2 instance, reads and writes
5	SD card	SimpleSPI2 instance, reads and writes

SPI flash writes (page-program) are not DMA-accelerated on SPI1 (QSPIflash). The QSPIflash controller's 1-bit SPI burst path does not reliably handle the DMA engine's per-32-byte dma_select cycling between SDRAM reads and SPI pushes. spi_flash_write_words falls back to byte-by-byte spi_transfer for SPI1; this is not a bottleneck because page-program latency is dominated by the flash chip's internal program cycle (~1 ms), not bus bandwidth. SPI flash reads on SPI1 use the dedicated QSPI Fast Read DMA path (SPI2MEM_QSPI mode), which issues a single continuous burst without per-chunk select cycling.

MEM2VRAM

Streams a 32-byte-aligned region of SDRAM into the VRAMPX write-port FIFO. The engine paces itself against the FIFO's full flag, so it cannot overrun the framebuffer SRAM. This is the primitive used for tear-free full-frame presents: software composes a frame in an SDRAM back buffer and blits it in one shot.

SPI2MEM_QSPI (mode 6)

Dedicated QSPI Fast Read mode for SPI1 (QSPIflash controller). Instead of using the regular SPI burst path, this mode tells the QSPIflash hardware to issue opcode 0xEB (Quad I/O Fast Read) and stream data at 4× bandwidth directly into the DMA engine's RX FIFO.

The flash address is set in DMA_QSPI_ADDR (24-bit), not DMA_SRC. The engine issues one continuous burst for the entire transfer, then drains 32 bytes per SDRAM line into the destination.

This is the primary fast path for BRFS sector reads on SPI Flash 1.

Interrupt

When DMA_CTRL.IRQ_EN is set, the engine raises interrupt line 7 on completion (success or error). The handler should read DMA_STATUS to clear the sticky bits. See the Interrupt Assignments table.

C API

Both libfpgc (used by BDOS) and userlib (used by userBDOS programs) ship a dma.h with the same surface. The synchronous helpers are the easy path:

#include <dma.h>

/* Synchronous SDRAM-to-SDRAM copy; brackets with ccached on both
 * sides. Returns 0 on success, -1 on engine error. */
int dma_copy(unsigned int dst, unsigned int src, unsigned int count);

/* Synchronous SDRAM-to-VRAMPX blit. dst must be in 0x1EC00000..0x1EC20000,
 * src 32-byte aligned in SDRAM, count a multiple of 32. Flushes the L1d
 * cache before the transfer; no post-invalidate needed (VRAMPX is
 * write-only from the CPU side). */
int dma_blit_to_vram(unsigned int dst, unsigned int src, unsigned int count);

For overlap with CPU work there are async equivalents:

void dma_start_mem2mem (unsigned int dst, unsigned int src, unsigned int count);
void dma_start_mem2vram(unsigned int dst, unsigned int src, unsigned int count);

int          dma_busy(void);    /* non-zero while STATUS.busy == 1 */
unsigned int dma_status(void);  /* one read; clears sticky bits     */

void cache_flush_data(void);    /* `ccached` wrapper                */

For SPI peripheral transfers (kernel-level, libfpgc only — not exposed to userBDOS programs):

/* Start a MEM2SPI or SPI2MEM transfer on the given SPI peripheral.
 * mode is DMA_MEM2SPI (1) or DMA_SPI2MEM (2).
 * The SPI peripheral must be selected (CS low) before calling.
 * Returns immediately — poll dma_busy() for completion. */
void dma_start_spi(int mode, int spi_id,
                   unsigned int dst, unsigned int src,
                   unsigned int count);

/* Start a QSPI Fast Read (mode 6) from SPI Flash 1.
 * Uses the QSPIflash controller's quad-output path for 4× bandwidth.
 * qspi_addr is the 24-bit flash address (written to DMA_QSPI_ADDR).
 * dst is the SDRAM destination (32-byte aligned).
 * Returns immediately — poll dma_busy() for completion. */
void dma_start_spi_qspi_read(int spi_id,
                              unsigned int dst,
                              unsigned int qspi_addr,
                              unsigned int count);

The async path leaves cache management to the caller — call cache_flush_data() before starting if the CPU just wrote the source, and after polling dma_busy() == 0 if the CPU is about to read the destination.

Typical pattern: tear-free framebuffer present

#include <syscall.h>
#include <dma.h>

#define PIXEL_FB_ADDR  0x1EC00000
#define W              320
#define H              240

unsigned int back_buf;   /* 32-byte aligned, holds a full frame */

int main(void) {
    unsigned char *raw = (unsigned char *)sys_heap_alloc(W * H + 32);
    back_buf = ((unsigned int)raw + 31u) & ~31u;

    while (running) {
        render_into(back_buf);                   /* CPU writes SDRAM */
        dma_blit_to_vram(PIXEL_FB_ADDR,
                         back_buf,
                         (unsigned int)(W * H)); /* atomic present  */
    }
}