Architecture¶
Design Philosophy¶
numaperf is built on three core principles:
- Locality by default - APIs guide you toward NUMA-aware patterns
- Transparency - Know what enforcement you actually got
- Graceful degradation - Works on any system, optimizes when possible
Crate Organization¶
numaperf is organized as a workspace of specialized crates:
numaperf (facade)
│
├── numaperf-core # Shared types and errors
│
├── numaperf-topo # Topology discovery
│ │
│ └── depends on: core
│
├── numaperf-affinity # Thread pinning
│ │
│ └── depends on: core
│
├── numaperf-mem # Memory placement
│ │
│ └── depends on: core
│
├── numaperf-sched # Work scheduling
│ │
│ └── depends on: core, topo, affinity
│
├── numaperf-sharded # Sharded data structures
│ │
│ └── depends on: core, topo
│
├── numaperf-io # Device locality
│ │
│ └── depends on: core, topo
│
└── numaperf-perf # Observability
│
└── depends on: core, topo, sharded
Crate Responsibilities¶
| Crate | Responsibility |
|---|---|
numaperf-core |
NodeId, CpuSet, NodeMask, NumaError, HardMode, Capabilities |
numaperf-topo |
Topology, NumaNode, discovery from /sys |
numaperf-affinity |
ScopedPin, get_affinity(), set_affinity() |
numaperf-mem |
NumaRegion, MemPolicy, mbind() wrapper |
numaperf-sched |
NumaExecutor, per-node worker pools, work stealing |
numaperf-sharded |
NumaSharded<T>, ShardedCounter, CachePadded<T> |
numaperf-io |
DeviceMap, device-to-node mapping |
numaperf-perf |
StatsCollector, LocalityReport, metrics |
Key Patterns¶
Arc¶
Topology discovery is expensive. Create once, share everywhere:
use numaperf::Topology;
use std::sync::Arc;
// Create once at startup
let topo = Arc::new(Topology::discover()?);
// Share across threads
let topo_clone = Arc::clone(&topo);
std::thread::spawn(move || {
// Use topo_clone
});
RAII Guards¶
Resources are managed with RAII patterns:
use numaperf::{ScopedPin, NumaRegion};
{
// Pin is active
let _pin = ScopedPin::pin_current(cpus)?;
// ...
} // Pin automatically restored
{
// Memory is mapped
let region = NumaRegion::anon(...)?;
// ...
} // Memory automatically unmapped
Builder Pattern¶
Complex types use builders:
use numaperf::{NumaExecutor, StealPolicy, HardMode};
let exec = NumaExecutor::builder(topo)
.steal_policy(StealPolicy::LocalThenSocketThenRemote)
.workers_per_node(4)
.hard_mode(HardMode::Strict)
.build()?;
Enforcement Transparency¶
Operations report what enforcement they achieved:
use numaperf::{NumaRegion, EnforcementLevel};
let region = NumaRegion::anon(...)?;
match region.enforcement() {
EnforcementLevel::Strict => println!("Guaranteed placement"),
EnforcementLevel::BestEffort { reason } => println!("Best effort: {}", reason),
EnforcementLevel::None { reason } => println!("No enforcement: {}", reason),
}
Thread Safety¶
| Type | Send | Sync | Notes |
|---|---|---|---|
Topology |
Yes | Yes | Immutable after creation |
NumaNode |
Yes | Yes | Immutable |
ScopedPin |
No | No | Thread-local by design |
NumaRegion |
Yes | Yes | Memory can be shared |
NumaExecutor |
Yes | Yes | Submit from any thread |
NumaSharded<T> |
If T: Send | If T: Sync | Depends on T |
StatsCollector |
Yes | Yes | Lock-free internals |
ScopedPin is !Send¶
ScopedPin intentionally cannot be sent between threads:
let pin = ScopedPin::pin_current(cpus)?;
// This won't compile - and that's correct!
std::thread::spawn(move || {
drop(pin); // Would restore wrong thread's affinity
});
Data Flow¶
Typical Application Flow¶
1. Startup
├── Capabilities::detect() ─► Check system support
└── Topology::discover() ─► Learn NUMA layout
2. Initialization
├── NumaExecutor::builder() ─► Create worker pools
├── NumaSharded::new() ─► Per-node data
└── StatsCollector::new() ─► Metrics collection
3. Runtime
├── exec.submit_to_node() ─► Submit work
├── sharded.local() ─► Access local data
└── collector.record_*() ─► Track locality
4. Shutdown
├── exec.shutdown() ─► Wait for completion
└── LocalityReport::generate() ─► Analyze results
Memory Allocation Flow¶
NumaRegion::anon(size, policy, huge_pages, prefault)
│
├── mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0)
│
├── mbind(addr, size, policy, nodemask, maxnode, flags)
│ │
│ ├── Success ─► EnforcementLevel::Strict
│ │
│ └── EPERM ─► Soft mode: EnforcementLevel::BestEffort
│ Hard mode: NumaError::HardModeUnavailable
│
└── prefault (if requested)
└── Touch each page to force allocation
Work Scheduling Flow¶
exec.submit_to_node(node_id, closure)
│
├── Find queue for target node
│
└── Push to node's work queue
│
└── Worker on that node picks it up
│
├── Execute closure
│
└── If queue empty, try stealing
│
├── LocalOnly: Never steal
│
├── LocalThenSocketThenRemote:
│ 1. Try same-socket nodes
│ 2. Try remote nodes
│
└── Any: Steal from any node
Error Handling¶
All fallible operations return Result<T, NumaError>:
pub enum NumaError {
// System errors
IoError(std::io::Error),
// Configuration errors
InvalidNodeId(u32),
InvalidCpuId(u32),
EmptyCpuSet,
EmptyNodeMask,
// Capability errors
NotSupported(String),
HardModeUnavailable { operation: String, reason: String },
// Runtime errors
TopologyMismatch,
WorkerPanic,
}
Errors include context for debugging:
match result {
Err(NumaError::HardModeUnavailable { operation, reason }) => {
eprintln!("Cannot enforce {} in hard mode: {}", operation, reason);
}
// ...
}
Platform Abstraction¶
Linux-specific code is isolated:
numaperf-topo/src/
├── lib.rs
├── topology.rs # Platform-agnostic API
├── node.rs
└── discovery/
├── mod.rs # Platform selection
├── linux.rs # Linux: reads /sys/devices/system/node/
└── fallback.rs # Other: single synthetic node
This allows:
- Full functionality on Linux
- Graceful degradation elsewhere
- Easy testing with synthetic topologies
Next Steps¶
- Soft vs Hard Mode - Enforcement modes explained
- Memory Policies - Memory placement in detail
- API Overview - Complete API reference