Frequently Asked Questions¶

Answers to the most common questions about gpuemu, organized by topic.

General¶

What is gpuemu?

gpuemu is a GPU-less validation toolchain for deep learning kernels. It lets you validate the correctness of GPU-targeted operations entirely on CPU, using deterministic reference implementations, numerical tolerance checking, shape/layout fuzzing, artifact linting, and CI integration.

gpuemu is not a cycle-accurate GPU emulator. It does not simulate GPU hardware or measure performance. Its purpose is to catch correctness bugs before code reaches real hardware.

Do I need a GPU to use gpuemu?

No. That is the entire point of gpuemu. All validation runs on CPU. The daemon executes reference scripts in Python, compares outputs numerically, and reports results -- no GPU drivers, no CUDA runtime, and no hardware required.

This makes gpuemu ideal for:

Local development on laptops without GPUs
CI pipelines running on CPU-only instances
Code review workflows where correctness matters more than performance

What frameworks are supported?

gpuemu provides first-class adapters for three major deep learning frameworks:

Framework	Adapter	Install Extra
PyTorch	`gpuemu.frameworks.pytorch`	`pip install gpuemu[torch]`
JAX	`gpuemu.frameworks.jax`	`pip install gpuemu[jax]`
TensorFlow	`gpuemu.frameworks.tensorflow`	`pip install gpuemu[tensorflow]`

Each adapter handles tensor conversion, dtype mapping, and framework-specific idioms so you can validate ops using native framework types.

What platforms are supported?

Platform	Status	Notes
Linux	Primary	Full workflow including artifact inspection (PTX/SASS analysis via `cuobjdump`)
macOS	Core validation	CPU validation works fully. Artifact inspection is optional and skipped if `cuobjdump` is not available.
Windows	Future	Not currently targeted. Contributions welcome.

Validation¶

How does validation work?

When you submit an op for validation, the following pipeline executes:

Your code computes the op output (on CPU or GPU).
The daemon spawns a CPU reference script that computes the expected output for the same inputs.
The validator compares the two outputs element-by-element using per-dtype absolute and relative tolerances.
Invariant checks verify structural properties (no NaN, no Inf, shape preserved, etc.).
The result is stored in the daemon's sled database and returned to the client.

# Run validation for all configured ops
gpuemu test

The daemon handles execution, comparison, and storage. You only need to provide the op output and a reference script.

What are tolerances?

Tolerances define the acceptable numerical difference between your op's output and the reference output. Floating-point arithmetic is inherently imprecise, and different implementations (GPU vs. CPU, different reduction orders) produce slightly different results. Tolerances account for this.

gpuemu uses two tolerance values per dtype:

Parameter	Meaning
`atol`	Absolute tolerance -- maximum allowed absolute difference (`\|actual - expected\|`)
`rtol`	Relative tolerance -- maximum allowed relative difference (`\|actual - expected\| / \|expected\|`)

Default tolerances by dtype:

Dtype	`atol`	`rtol`
`float32`	`1e-5`	`1e-5`
`float16`	`1e-2`	`1e-2`
`bfloat16`	`1e-2`	`1e-2`
`float64`	`1e-10`	`1e-10`

A validation passes if, for every element, the difference satisfies: |actual - expected| <= atol + rtol * |expected|.

Can I customize tolerances?

Yes. Tolerances can be customized at multiple levels:

Per-dtype globally in gpuemu.toml:

[validation.tolerances]
float32 = { atol = 1e-5, rtol = 1e-5 }
float16 = { atol = 1e-3, rtol = 1e-3 }

Per-op in gpuemu.toml (overrides global defaults):

[[ops]]
name = "my_op"
reference = "scripts/my_op_ref.py"

[ops.tolerances]
float32 = { atol = 1e-4, rtol = 1e-4 }

Programmatically in Python using calibration:

from gpuemu.tolerances import calibrate_tolerance

# Run multiple iterations and find the tightest tolerance that passes
recommended = calibrate_tolerance(client, "my_op", dtype="float32", iterations=100)
print(recommended)  # {"atol": 2.5e-6, "rtol": 3.1e-6}

What are invariants?

Invariants are structural checks applied to op outputs, independent of numerical value comparison. They catch categories of bugs that tolerances alone would miss.

Invariant	What it checks
`shape_preserved`	Output tensor shape matches the reference output shape exactly
`non_negative`	All output elements are >= 0 (useful for ReLU, softmax, etc.)
`finite`	All output elements are finite (no NaN, no Inf)
`symmetric`	Output matrix is symmetric (for square matrix outputs)
`normalized`	Values sum to 1 along the last axis (for probability distributions)

Configure invariants per-op in gpuemu.toml:

[[ops]]
name = "softmax"
reference = "scripts/softmax_ref.py"
invariants = ["shape_preserved", "non_negative", "normalized"]

What is a seed?

A seed is a deterministic random number used to generate test inputs. When gpuemu creates input tensors for validation or fuzzing, it uses a seeded pseudorandom number generator (xorshift128+ with Blake2b seed derivation). The same seed always produces the same inputs.

This guarantees reproducible tests:

# First run discovers a failure at seed 98765
gpuemu fuzz --op matmul --iterations 100
# ...FAIL at seed 98765

# Reproduce the exact same failure
gpuemu test --seed 98765

The cross-language RNG implementation (Rust and Python produce identical sequences) means you can reproduce failures regardless of which client triggered them.

Fuzzing¶

What does fuzzing test?

Fuzzing automatically generates randomized test inputs to stress-test your op across a wide range of conditions. The fuzzer varies:

Shapes -- batch sizes, sequence lengths, hidden dimensions, edge cases like size-0 and size-1 dimensions
Dtypes -- all configured dtypes (float32, float16, bfloat16, float64)
Memory layouts -- contiguous, strided, transposed, and non-contiguous tensor layouts
Value ranges -- normal values, very small values (subnormals), very large values, mixed signs

gpuemu fuzz --op matmul --iterations 100

Each iteration uses a unique deterministic seed, so any failure can be reproduced exactly.

How many fuzzing iterations should I run?

It depends on your goals:

Scenario	Recommended Iterations
Quick sanity check during development	50--100
Pre-commit or pull request CI gate	100--500
Thorough nightly testing	1,000+
Initial validation of a new op	5,000--10,000

More iterations cover more of the input space but take longer. A good strategy is to run a small number on every commit and a large number on a nightly schedule:

# Fast check (CI on every push)
gpuemu fuzz --op matmul --iterations 100

# Thorough check (nightly)
gpuemu fuzz --op matmul --iterations 5000

What is test case minimization?

When fuzzing discovers a failure, the failing input may be large and complex (e.g., a 128x256 matrix). Test case minimization automatically searches for the smallest input that still triggers the same failure.

The minimizer uses binary search on tensor dimensions and values to shrink the reproducer:

# Minimize a failure found at seed 98765
gpuemu minimize --op matmul --seed 98765

Example output:

Original: shape (128, 256) x (256, 64) -- FAIL (max diff: 2.3e-4)
Minimized: shape (2, 3) x (3, 2) -- FAIL (max diff: 1.8e-4)
Minimal reproducer saved. Seed: 98765, shapes: [(2, 3), (3, 2)]

A smaller reproducer is easier to debug and makes a better regression test.

Architecture¶

What is the daemon?

The gpuemu daemon (gpuemu-daemon) is a long-running background Rust process that serves as the validation engine. It handles:

IPC -- Listens on a Unix domain socket (~/.gpuemu/gpuemu.sock) via NNG REP/REQ protocol
Execution -- Spawns Python reference scripts as subprocesses to compute expected outputs
Validation -- Compares op outputs against reference outputs with per-dtype tolerances
Fuzzing -- Generates randomized test cases with deterministic seeds
Storage -- Persists results, failures, baselines, and artifact metrics in a sled embedded database
Artifact analysis -- Parses PTX/SASS output and lints against configurable policies

# Start the daemon
gpuemu daemon start --background

# Check status
gpuemu daemon status

# Stop it
gpuemu daemon stop

The CLI, Python client, and VS Code extension all communicate with the daemon over IPC. They do not perform validation themselves.

Where is data stored?

All gpuemu runtime data is stored under the ~/.gpuemu/ directory:

~/.gpuemu/
├── gpuemu.sock       # Unix domain socket (IPC endpoint)
├── bin/              # CLI binary (if installed here)
├── db/               # sled embedded database
│   ├── results/      # Validation results
│   ├── failures/     # Recorded failures
│   ├── baselines/    # Named baseline snapshots
│   └── artifacts/    # Artifact metrics and baselines
└── logs/             # Daemon log files

The sled database uses rkyv for zero-copy deserialization, making lookups fast even with thousands of stored results. Data persists across daemon restarts.

What is a reference script?

A reference script is a standalone Python program that provides the canonical CPU implementation of an operation. It follows a strict protocol:

Read JSON with base64-encoded tensor data from stdin
Compute the expected output using standard CPU libraries (NumPy, etc.)
Write JSON with base64-encoded result tensors to stdout

scripts/matmul_ref.py

import json, base64, sys
import numpy as np

def decode_tensor(encoded):
    data = base64.b64decode(encoded["data"])
    return np.frombuffer(data, dtype=np.dtype(encoded["dtype"])).reshape(encoded["shape"])

def encode_tensor(arr):
    return {
        "data": base64.b64encode(arr.tobytes()).decode("ascii"),
        "dtype": str(arr.dtype),
        "shape": list(arr.shape),
    }

request = json.loads(sys.stdin.read())
a = decode_tensor(request["inputs"]["a"])
b = decode_tensor(request["inputs"]["b"])
result = np.matmul(a, b)
json.dump({"outputs": {"result": encode_tensor(result)}}, sys.stdout)

Reference scripts must be deterministic, side-effect-free, and must not write anything to stdout except the JSON response.

CI¶

Can I use gpuemu in CI without a GPU?

Yes. Running in CI without a GPU is a core design goal of gpuemu. The daemon, CLI, and Python client all run on CPU-only machines. A typical CI workflow:

.github/workflows/gpuemu.yml

jobs:
  validate:
    runs-on: ubuntu-latest  # No GPU needed
    steps:
      - uses: actions/checkout@v4
      - name: Install gpuemu
        run: |
          cargo install gpuemu
          pip install gpuemu
      - name: Start daemon
        run: gpuemu daemon start --background
      - name: Run tests
        run: gpuemu test --format junit > results.xml
      - name: Upload results
        uses: actions/upload-artifact@v4
        with:
          name: test-results
          path: results.xml

No GPU runners, no CUDA installation, no special hardware. Standard CI infrastructure is sufficient.

What output formats does CI support?

gpuemu supports three output formats for CI integration:

Format	Flag	Use Case
Text	`--format text` (default)	Human-readable console output
JSON	`--format json`	Machine-readable structured output for custom tooling
JUnit XML	`--format junit`	Standard test report format consumed by CI platforms (GitHub Actions, GitLab CI, Jenkins, etc.)

# Human-readable output
gpuemu test --format text

# JSON for scripting
gpuemu test --format json > results.json

# JUnit XML for CI platforms
gpuemu test --format junit > results.xml

The JSON format includes all validation details (seed, max diff, tolerances used, failure reasons) and is suitable for building dashboards or custom reporting.

Next Steps¶

Common Issues -- Solutions to specific error messages and problems.
Configuration -- Full reference for gpuemu.toml settings.
Architecture -- Understand how the components fit together.
CI Integration Tutorial -- Step-by-step CI setup guide.