Architecture Overview¶
ZViz is a container isolation runtime that achieves gVisor-equivalent security outcomes through layered kernel enforcement rather than syscall emulation.
Design Philosophy¶
The Problem¶
You need to run code you can't trust: AI agents, CI/CD pipelines, third-party plugins, multi-tenant workloads. Traditional containers (runc) share the kernel attack surface with the host. gVisor solves this with a userspace kernel, but at significant performance cost.
The Solution¶
ZViz reaches the same security outcomes without emulating Linux by:
- Composing kernel primitives - Namespaces, seccomp, LSMs, cgroups
- Brokering only when necessary - Syscalls needing argument inspection
- Profile-driven enforcement - Workload-specific policies
Trade-offs¶
| Aspect | gVisor | ZViz |
|---|---|---|
| Kernel exposure | Minimal (sentry) | Controlled (filtered) |
| Syscall overhead | High (all syscalls) | Low (brokered only) |
| Compatibility | Limited | Native for allowed syscalls |
| Memory overhead | ~50MB | ~2MB |
| Cold start | ~200ms | ~8ms |
| Network performance | Emulated stack | Native stack |
| Nested containers | Emulated (works) | Blocked (use gVisor) |
System Architecture¶
┌─────────────────────────────────────────────────────────────────────┐
│ Container │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Application │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ syscall │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Seccomp-BPF Filter │ │
│ │ ┌──────────────────┬────────────────┬──────────────────┐ │ │
│ │ │ ALLOW │ DENY │ USER_NOTIF │ │ │
│ │ │ (native) │ (EPERM) │ (mediated) │ │ │
│ │ │ 90 syscalls │ 22 syscalls │ 5 syscalls │ │ │
│ │ └──────────────────┴────────────────┴────────┬─────────┘ │ │
│ └─────────────────────────────────────────────────│─────────────┘ │
│ │ │
│ ┌───────────────────────────────────────────────│───────────────┐ │
│ │ Containment Layer │ │ │
│ │ • User namespace (UID/GID mapping) │ │ │
│ │ • PID namespace (process isolation) │ │ │
│ │ • Mount namespace (filesystem isolation) │ │ │
│ │ • Network namespace (network isolation) │ │ │
│ │ • All 41 capabilities dropped │ │ │
│ └───────────────────────────────────────────────│───────────────┘ │
└──────────────────────────────────────────────────│──────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ ZViz Broker │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Syscall Mediators │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ openat │ │ ioctl │ │ socket │ │ clone │ │ execve │ │ │
│ │ │ Path │ │ Command │ │ Domain │ │ Flag │ │ Path │ │ │
│ │ │ check │ │ filter │ │ filter │ │ validate│ │ check │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ allow/deny/fd │
│ ▼ │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Host Kernel │ │
│ │ • Landlock LSM (filesystem access control) │ │
│ │ • cgroups v2 (resource limits) │ │
│ │ • nftables (network policy) │ │
│ └───────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
Enforcement Layers¶
ZViz applies five enforcement layers in order:
Layer 1: Namespaces¶
Isolation using Linux namespaces:
- User namespace - Maps container UID 0 to unprivileged host UID
- PID namespace - Isolates process tree
- Mount namespace - Isolated filesystem view
- Network namespace - Isolated network stack
- IPC namespace - Isolated System V IPC
- UTS namespace - Isolated hostname
Layer 2: Capabilities¶
All 41 Linux capabilities are dropped via prctl(PR_CAPBSET_DROP). Even if the container runs as "root", it has no privileged capabilities.
Layer 3: Landlock LSM¶
Filesystem access control:
- Read-only rootfs
- Writable only:
/tmp,/work - No access to host paths
Layer 4: Seccomp-BPF¶
124-instruction BPF filter classifying syscalls:
| Action | Use Case | Performance |
|---|---|---|
ALLOW |
Safe syscalls (read, write, mmap) | Native kernel speed |
DENY |
Dangerous syscalls (ptrace, mount, bpf) | Immediate EPERM |
USER_NOTIF |
Need inspection (openat, socket, execve) | Broker overhead |
Layer 5: cgroups v2¶
Resource limits via cgroups v2 controllers:
- memory - Memory + swap limits
- cpu - CPU quota/shares
- pids - Process count limits
- io - I/O bandwidth limits
The Broker¶
The broker is the central policy decision point for mediated syscalls:
┌─────────────────────────────────────────┐
│ ZViz Broker │
│ │
│ ┌────────────────────────────────────┐ │
│ │ Notification Listener │ │
│ │ • SECCOMP_RET_USER_NOTIF │ │
│ │ • epoll-based event loop │ │
│ └────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ Mediators │ │
│ │ │ │
│ │ openat: Path traversal check │ │
│ │ ioctl: Command allowlist │ │
│ │ socket: Domain/type filter │ │
│ │ clone: Flag validation │ │
│ │ execve: Path check │ │
│ │ prctl: Operation filter │ │
│ └────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────┐ │
│ │ Response Handler │ │
│ │ • SECCOMP_ADDFD for fds │ │
│ │ • Error injection │ │
│ └────────────────────────────────────┘ │
└─────────────────────────────────────────┘
Profile System¶
Profiles are workload-specific security policies:
| Profile | Description |
|---|---|
ci-runner |
CI/CD workloads (default) |
web-server |
HTTP servers, APIs |
batch-job |
Data processing (no network) |
hostile-tenant |
Maximum security |
development |
Debugging (allows ptrace) |
Profiles compile to enforcement artifacts:
Performance Characteristics¶
Overhead Model¶
| Syscall Type | Overhead |
|---|---|
| Allowed (fast path) | ~0 |
| Denied | ~0 |
| Brokered (allow) | ~50-100μs |
| Brokered (fd return) | ~100-200μs |
Why ZViz is Fast¶
- Minimal brokered set - Only 5 syscalls go through the broker
- BPF fast path - 90 common syscalls execute at native speed
- No emulation - Allowed syscalls hit the kernel directly
When to Use gVisor Instead¶
ZViz blocks dangerous syscalls with EPERM. gVisor emulates them safely. For some workloads, emulation is required:
- Docker-in-Docker - Needs
mount,unshare - Bazel / Nix builds - Internal sandboxing creates namespaces
- strace / debuggers - Needs
ptrace
For these workloads, use gVisor. For everything else, ZViz is faster and stricter.