Architecture Internals¶
A deep dive into the gpuemu codebase for contributors. This page covers the internal structure of every component, the IPC protocol implementation, the validation pipeline, and the design decisions behind each subsystem.
Crate Structure¶
gpuemu is a Rust workspace with three crates, each with a distinct responsibility.
gpuemu-common¶
Shared library crate used by both the daemon and CLI. Contains no binaries.
crates/gpuemu-common/src/
├── lib.rs # Crate root, re-exports
├── types.rs # Core data types: TensorData, ValidationResult, FailureReason, etc.
├── protocol.rs # IPC message definitions: Request, Response, error codes
├── config.rs # GpuemuConfig, OpConfig, KernelConfig, ToleranceConfig parsing
└── rng.rs # Xorshift128+ RNG with Blake2b seed derivation
| File | Responsibility |
|---|---|
types.rs |
Defines TensorData (shape, strides, dtype, raw bytes), ValidationResult, FailureReason, FuzzConfig, ArtifactMetrics, and all other shared types. Uses serde for JSON serialization and rkyv for zero-copy storage. |
protocol.rs |
Defines the Request and Response enums that constitute the IPC wire protocol. Includes PROTOCOL_VERSION constant and all error code variants. |
config.rs |
Parses gpuemu.toml into GpuemuConfig. Handles merging of project, user, and daemon config files. Validates field values and provides defaults. |
rng.rs |
Implements xorshift128+ PRNG with a Blake2b-based seed derivation function. The same algorithm is implemented in Python (gpuemu/rng.py) for cross-language reproducibility. Given the same seed, both implementations produce bit-identical sequences. |
gpuemu-daemon¶
The daemon binary crate. Contains the server, validation engine, and all background processing logic.
crates/gpuemu-daemon/src/
├── main.rs # Daemon entry point, CLI args, tokio runtime setup
├── server.rs # NNG REP socket listener, request routing, response dispatch
├── validator.rs # Validation engine: shape/dtype/value/NaN/Inf/invariant checks
├── executor.rs # Reference script subprocess management
├── fuzzer.rs # Fuzz test case generation and orchestration
├── artifact.rs # PTX/SASS parsing, artifact linting, baseline diffing
└── storage.rs # sled database interface for results, failures, baselines, artifacts
| File | Responsibility |
|---|---|
server.rs |
Binds the NNG REP socket at the configured path (default ~/.gpuemu/gpuemu.sock). Runs an async event loop via Tokio that accepts requests, deserializes them, routes to the appropriate handler, and sends back serialized responses. Handles Ping, Shutdown, and all other request types. |
validator.rs |
The core validation engine. Accepts a TensorData (actual output) and a TensorData (reference output) and runs an ordered check pipeline: (1) shape match, (2) dtype match, (3) element-wise value comparison with per-dtype atol/rtol, (4) NaN detection, (5) Inf detection, (6) invariant enforcement. Returns a ValidationResult with pass/fail status, max absolute diff, max relative diff, and a list of FailureReason values. |
executor.rs |
Manages Python reference script execution. Spawns scripts as child processes via std::process::Command, writes JSON+base64 input to stdin, reads JSON+base64 output from stdout. Implements a configurable timeout (default 60 seconds). Captures stderr for error reporting. |
fuzzer.rs |
Generates randomized test cases from a FuzzConfig. Varies shapes (within configured min/max bounds), dtypes, memory layouts, and value distributions. Uses the xorshift128+ RNG from gpuemu-common for deterministic seed-based generation. Orchestrates batch execution via the executor and validator. Supports test case minimization via binary search on dimensions. |
artifact.rs |
Contains PtxParser and SassParser for extracting metrics from compiled GPU artifacts. PtxParser uses regex-based extraction to find register counts, spill counts, shared memory usage, local memory usage, and instruction counts from PTX text. SassParser (optional) invokes cuobjdump as a subprocess for SASS-level analysis. ArtifactLinter checks extracted metrics against policy thresholds. ArtifactDiffer compares current metrics against a stored baseline and flags regressions. |
storage.rs |
Wraps the sled embedded database. Provides typed get/put/list operations for validation results, failures, baselines, and artifact metrics. Uses rkyv for zero-copy serialization of stored values, making reads fast without deserialization overhead. Handles database creation, compaction, and error recovery. |
gpuemu¶
The CLI binary crate. Produces the user-facing gpuemu command.
crates/gpuemu/src/
├── main.rs # Entry point, clap argument parsing, subcommand dispatch
├── report.rs # Output formatting: text, JSON, JUnit XML
├── debug/
│ ├── mod.rs # Debug subcommand module
│ └── repl.rs # Interactive debug REPL for inspecting validation state
└── init/
├── mod.rs # Init subcommand module
└── templates.rs # Scaffolding templates for gpuemu.toml and reference scripts
| File | Responsibility |
|---|---|
main.rs |
Defines the CLI structure using clap. Subcommands include daemon (start/stop/status/logs), test, fuzz, reproduce, minimize, failures, baseline, lint, ci, init, debug, version, and status. Each subcommand connects to the daemon via NNG REQ socket and sends the appropriate request. |
report.rs |
Formats validation results for output. Supports three modes: text (human-readable with colored pass/fail indicators), json (structured JSON for scripting), and junit (JUnit XML for CI platform integration). |
debug/repl.rs |
An interactive REPL for debugging validation issues. Allows inspecting stored results, replaying seeds, examining tensor values, and stepping through the validation pipeline. Useful for understanding why a specific test case fails. |
init/templates.rs |
Contains template strings for gpuemu init scaffolding. Generates a starter gpuemu.toml and example reference script based on the selected framework. |
IPC Layer¶
All communication between clients (CLI, Python, VS Code) and the daemon uses NNG (nanomsg-next-gen) sockets with JSON serialization.
Transport¶
| Property | Value |
|---|---|
| Socket type | REP/REQ (synchronous request-response) |
| Transport | Unix domain socket via NNG IPC (ipc:// scheme) |
| Default path | ~/.gpuemu/gpuemu.sock |
| Serialization | JSON (serde_json in Rust, json in Python) |
| Protocol version | 1 (PROTOCOL_VERSION in protocol.rs) |
Request-Response Flow¶
sequenceDiagram
participant Client as Client (REQ)
participant Server as Daemon (REP)
Client->>Server: JSON request bytes
Server->>Server: Deserialize Request enum
Server->>Server: Route to handler
Server->>Server: Execute (validate, fuzz, etc.)
Server->>Client: JSON response bytes
The protocol is strictly synchronous: one request produces one response. The NNG REP/REQ pattern enforces this at the socket level. Concurrent clients are serialized by the daemon's event loop.
Message Definitions¶
All messages are defined as Rust enums in protocol.rs and serialized as JSON with a type discriminator field:
// Simplified from protocol.rs
#[derive(Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum Request {
Ping,
Shutdown,
ValidateOp { op_name: String, inputs: HashMap<String, TensorData>, output: TensorData, kwargs: Value },
FuzzOp { op_name: String, config: FuzzConfig },
Reproduce { op_name: String, seed: u64 },
Minimize { op_name: String, seed: u64 },
// ... additional variants
}
#[derive(Serialize, Deserialize)]
#[serde(tag = "type")]
pub enum Response {
Pong { version: String, protocol_version: u32, uptime_seconds: u64 },
Ok,
ValidationResult { result: ValidationResult },
Error { code: String, message: String },
// ... additional variants
}
Version Negotiation¶
Clients send a Ping request on first connection. The daemon responds with Pong containing the protocol version. If the versions do not match, the client raises a ClientError with instructions to upgrade.
Storage¶
sled Database¶
gpuemu uses sled, an embedded key-value database written in Rust. The database lives at ~/.gpuemu/db/ and is opened when the daemon starts.
Data Organization¶
Data is organized into logical trees (namespaces) within the sled database:
| Tree | Key | Value | Purpose |
|---|---|---|---|
results |
{op_name}:{seed} |
ValidationResult (rkyv) |
Stores every validation result |
failures |
{op_name}:{seed}:{timestamp} |
ValidationResult (rkyv) |
Index of failed validations for quick lookup |
baselines |
{tag}:{op_name} |
ValidationResult (rkyv) |
Named baseline snapshots for regression detection |
artifacts |
{kernel_name}:{timestamp} |
ArtifactMetrics (rkyv) |
PTX/SASS metrics for each kernel |
artifact_baselines |
{tag}:{kernel_name} |
ArtifactMetrics (rkyv) |
Named artifact baselines for regression detection |
Serialization¶
Values stored in sled use rkyv for zero-copy deserialization. This means reading a stored ValidationResult does not require parsing or allocation -- the bytes in the database are directly interpretable as a Rust struct. This makes bulk operations (listing results, scanning failures) fast.
The IPC wire protocol uses JSON (via serde_json) because it needs to be readable by Python and TypeScript clients. The storage layer uses rkyv because it only needs to be readable by Rust.
Validation Pipeline¶
The validation engine in validator.rs runs a fixed sequence of checks. Each check can produce a FailureReason that is collected into the final ValidationResult.
graph TD
A[Receive actual + reference TensorData] --> B{Shape match?}
B -->|No| Z[FAIL: ShapeMismatch]
B -->|Yes| C{Dtype match?}
C -->|No| Z2[FAIL: DtypeMismatch]
C -->|Yes| D[Element-wise comparison]
D --> E{Within atol + rtol * abs_expected?}
E -->|No| Z3[FAIL: ToleranceExceeded]
E -->|Yes| F{NaN check enabled?}
F -->|Yes| G{Any NaN in output?}
G -->|Yes| Z4[FAIL: NaNDetected]
G -->|No| H{Inf check enabled?}
F -->|No| H
H -->|Yes| I{Any Inf in output?}
I -->|Yes| Z5[FAIL: InfDetected]
I -->|No| J{Invariants}
H -->|No| J
J --> K{All invariants pass?}
K -->|No| Z6[FAIL: InvariantViolation]
K -->|Yes| L[PASS]
Executor Detail¶
The Executor in executor.rs manages the reference script subprocess lifecycle:
- Serialize inputs: Convert
HashMap<String, TensorData>to JSON with base64-encoded tensor bytes. - Spawn process:
std::process::Command::new("python3").arg(script_path).stdin(Stdio::piped()).stdout(Stdio::piped()).stderr(Stdio::piped()) - Write stdin: Pipe the JSON input to the child process stdin.
- Read stdout: Read the child's stdout, parse as JSON, decode base64 tensor data back into
TensorData. - Timeout: If the child does not exit within the configured timeout, kill it and return
ReferenceScriptFailed. - Error handling: If the exit code is non-zero, return
ReferenceScriptFailedwith the captured stderr.
Fuzzer¶
The fuzzer in fuzzer.rs generates test cases by systematically varying input parameters.
FuzzConfig¶
The FuzzConfig struct controls what the fuzzer varies:
pub struct FuzzConfig {
pub iterations: usize, // Number of test cases to generate
pub min_shape: Vec<usize>, // Minimum dimensions (e.g., [1, 1])
pub max_shape: Vec<usize>, // Maximum dimensions (e.g., [256, 512])
pub dtypes: Vec<String>, // Dtypes to test
pub layouts: Vec<Layout>, // Memory layouts: Contiguous, Strided, Transposed
pub seed: Option<u64>, // Base seed (None = random)
}
RNG and Reproducibility¶
The fuzzer uses xorshift128+ as its core PRNG, seeded via Blake2b hash derivation. This design enables two critical properties:
- Deterministic: Given the same seed, the fuzzer produces identical test cases on every run.
- Cross-language: The same algorithm is implemented in both Rust (
gpuemu-common/src/rng.rs) and Python (gpuemu/rng.py), producing bit-identical sequences. This means a failure seed from the CLI can be reproduced in Python, and vice versa.
The seed derivation chain:
base_seed (u64)
-> Blake2b(base_seed || iteration_index)
-> 128-bit state
-> xorshift128+ stream
-> shape values, dtype selection, tensor element values
Test Case Minimization¶
When a failure is found, the minimizer (minimize function in fuzzer.rs) searches for the smallest input that still triggers the same failure. It uses binary search on:
- Dimensions: Halve each dimension independently until the failure disappears, then back off.
- Values: Narrow the value range to isolate problematic magnitudes.
The result is a minimal reproducer that is easier to debug and suitable as a regression test.
Artifact Inspector¶
The artifact subsystem in artifact.rs analyzes compiled GPU code without executing it.
PtxParser¶
Extracts metrics from PTX assembly text using regex patterns:
| Metric | Regex Pattern | Example Match |
|---|---|---|
| Register count | .reg .b32 %r<N> |
.reg .b32 %r<64> -> 64 registers |
| Spill stores | st.local instructions |
Count of st.local occurrences |
| Spill loads | ld.local instructions |
Count of ld.local occurrences |
| Shared memory | .shared .align N .bM name[SIZE] |
Shared memory allocation size |
| Local memory | .local .align N .bM name[SIZE] |
Local memory allocation size |
| Instruction count | All non-directive, non-label lines | Total instruction count |
SassParser (Optional)¶
Invokes cuobjdump --dump-sass as a subprocess to extract SASS-level metrics. Only available on Linux systems with the CUDA toolkit installed. Falls back gracefully when cuobjdump is not found.
ArtifactLinter¶
Checks extracted metrics against policy thresholds defined in [[kernels]] configuration:
[kernels.artifact_checks]
max_registers = 64
max_spills = 0
max_local_memory_bytes = 0
forbidden_instructions = ["LDG.E.SYS"]
Produces warnings or failures when thresholds are exceeded.
ArtifactDiffer¶
Compares current artifact metrics against a named baseline. Reports regressions (metrics that increased) and improvements (metrics that decreased). Used in CI to detect performance regressions in compiled kernel code.
Python Package Structure¶
The Python client (gpuemu) provides programmatic access to the daemon and framework-specific adapters.
gpuemu/
├── __init__.py # Package exports
├── client.py # GpuemuClient: NNG REQ socket, send/receive, protocol handling
├── validate.py # validate_op(), test case generation, result parsing
├── rng.py # Xorshift128+ and Blake2b seed derivation (mirrors Rust impl)
├── tolerances.py # get_recommended_tolerance(), calibrate_tolerance()
└── frameworks/
├── __init__.py
├── base.py # BaseAdapter: abstract interface for framework adapters
├── pytorch.py # PyTorchAdapter: torch.Tensor <-> TensorData conversion
├── jax.py # JaxAdapter: jax.Array <-> TensorData conversion
└── tensorflow.py # TensorFlowAdapter: tf.Tensor <-> TensorData conversion
| Module | Responsibility |
|---|---|
client.py |
Manages the NNG REQ socket connection to the daemon. Handles connection, disconnection, request serialization, response deserialization, protocol version checking, and error mapping to Python exceptions (ClientError, ValidationError, ConnectionError). |
validate.py |
High-level validation API. validate_op() accepts an op name, input tensors, and the computed output, sends a ValidateOp request to the daemon, and returns a structured ValidationResult. Also provides fuzz_op() and reproduce() wrappers. |
rng.py |
Pure-Python implementation of the same xorshift128+ PRNG and Blake2b seed derivation used in Rust. Ensures that seeds are portable across languages -- a seed from a Rust fuzz run can be used in Python to generate identical inputs. |
tolerances.py |
Tolerance utilities. get_recommended_tolerance(op, dtype) returns empirically-tuned defaults. calibrate_tolerance(client, op, dtype, iterations) runs multiple validation passes and computes the minimum tolerance that passes all iterations, with a configurable safety margin. |
frameworks/base.py |
Defines BaseAdapter with abstract methods to_tensor_data() and from_tensor_data(). All framework adapters inherit from this. |
frameworks/pytorch.py |
Converts torch.Tensor to/from the TensorData dict format. Handles device transfer (GPU -> CPU), dtype mapping, stride extraction, and the validate_pytorch() context manager. |
frameworks/jax.py |
Converts jax.Array to/from TensorData. Handles JAX's functional tensor model and the validate_jax() context manager. |
frameworks/tensorflow.py |
Converts tf.Tensor to/from TensorData. Handles eager/graph mode differences and the validate_tensorflow() context manager. |
VS Code Extension¶
The VS Code extension (vscode-gpuemu) provides editor integration by invoking the gpuemu CLI as a child process and mapping results to VS Code APIs.
vscode-gpuemu/src/
├── extension.ts # Extension activation, registration of providers and commands
├── runner.ts # Spawns gpuemu CLI commands, parses JSON output
├── providers/
│ ├── diagnostics.ts # DiagnosticManager: maps validation failures to Problems panel
│ ├── codeActions.ts # Quick fixes: "Reproduce this failure", "Minimize", "Adjust tolerance"
│ ├── configValidator.ts # Validates gpuemu.toml and reports config errors
│ ├── failuresTree.ts # Tree view provider for the Failures sidebar
│ ├── statusBar.ts # Status bar item showing daemon status and last run result
│ ├── testController.ts # TestController: integrates with VS Code Testing sidebar
│ └── validationWatcher.ts # FileSystemWatcher: triggers validation on file save
└── commands/
└── index.ts # Command palette commands (start daemon, run tests, fuzz, etc.)
Pseudo-LSP Architecture¶
The extension does not implement a full Language Server Protocol server. Instead, it uses a "pseudo-LSP" pattern:
runner.tsspawnsgpuemuCLI commands with--format jsonand parses the structured output.- Providers consume the parsed results and push them to VS Code APIs (diagnostics, test items, tree views).
- Watchers trigger re-validation when relevant files change.
This design avoids maintaining a separate language server process while still providing rich editor integration.
Key Providers¶
DiagnosticManager (diagnostics.ts): Maps validation failures to VS Code Diagnostic objects in the Problems panel. Each failure becomes a diagnostic with severity (Error for failures, Warning for tolerance warnings), a message describing the failure reason, and a source location pointing to the op definition in gpuemu.toml or the source file.
TestController (testController.ts): Integrates with VS Code's built-in Testing sidebar. Each op becomes a test item with child items for each dtype. Running tests invokes gpuemu test --format json and maps results to pass/fail/skip states.
ValidationWatcher (validationWatcher.ts): Watches for saves to reference scripts and op source files. When a watched file is saved, it triggers re-validation for the affected ops and updates diagnostics. Debounces rapid saves to avoid excessive daemon calls.
Next Steps¶
- Contributing -- How to set up a development environment and submit changes.
- Architecture Overview -- Higher-level architecture for users (not just contributors).
- IPC Protocol Reference -- Full protocol specification.
- Configuration -- Config file format and all available options.