Enforcement Model¶
ZViz applies five layers of kernel-level enforcement before execve-ing the
workload. The order matters: each layer's setup must complete while the
previous layer still permits it.
The five layers, in application order¶
- Namespaces (
user,pid,mount,ipc,uts;netoptional). The parent forks; the child doesunshare(CLONE_NEWUSER | ...)before the parent writes the uid/gid maps and signals the child to proceed. BecauseCLONE_NEWPIDputs only future children into the new pid namespace, the child then forks once more so the workload itself becomes pid 1 of that namespace. - Capability drop via
PR_CAPBSET_DROPfor every capability in the process's bounding set, thenPR_SET_NO_NEW_PRIVSso the drop is irrevocable acrossexecve. The default profile drops all 41 capabilities;CAP_NET_RAW,CAP_SYS_ADMIN, etc. are unreachable for the workload. - Landlock (stackable LSM, Linux ≥ 5.13). Build a ruleset with
LANDLOCK_ACCESS_FS_READ_WRITEon the container's rootfs (orREAD_EXECif the OCIroot.readonly: truewas set) andREAD_WRITEon/tmp, then commit withlandlock_restrict_self. The restriction is irrevocable for the calling task and its descendants. - seccomp-BPF. Load the compiled filter (a single classic-BPF program of ~168 instructions: 130 allow checks, 24 deny checks, the inline socket-domain filter, plus the action returns). The kernel evaluates the program on every syscall.
- cgroups v2. The parent attached the child to a per-container cgroup
before the child reached this stage; the cgroup limits memory and the
number of processes (default
pids.max=512,memory.max=2G).
Filesystem setup inside the mount namespace¶
In addition to the security layers, the executor sets up the container's
filesystem inside its mount namespace, before pivot_root:
- Bind-mounts the rootfs onto itself (a prerequisite for
pivot_root). - Mounts a private
tmpfsat<rootfs>/dev(mode 755) and bind-mounts the host's/dev/null,/dev/zero,/dev/full,/dev/random,/dev/urandom,/dev/ttyonto empty target files inside it. Afterpivot_root, it adds the standard/dev/std{in,out,err}and/dev/fdsymlinks to/proc/self/fd. - Mounts
procfsat<rootfs>/proc(nosuid,nodev,noexec) andsysfsat<rootfs>/sys(nosuid,nodev,noexec,ro). Both mounts have to happen pre-pivot because an unprivileged user namespace's procfs check requires ancestor-procfs visibility, whichpivot_rootremoves. - Applies the OCI
mounts[]array:bind/rbindwith the Linux-specificMS_BINDfollowed by anMS_REMOUNT|MS_RDONLYwhenrois requested (the kernel ignoresMS_RDONLYon the initial bind syscall; the remount is mandatory).tmpfs/proc/sysfstyped mounts go through one syscall each. Unknown options are logged and skipped.
Filter size¶
The current default container profile produces a 168-instruction filter: 130 allow + 24 deny + 5 socket-domain check + 3 load + 4 action returns + 2 architecture check. That fits comfortably under the kernel's 4096-instruction BPF limit.
Multi-architecture support¶
The seccomp filter generator is architecture-dispatched via an Arch enum
that emits the matching AUDIT_ARCH_* constant at filter-build time. The
project cross-compiles cleanly for aarch64-linux-musl. The syscall number
tables in src/schema/schema.zig are currently x86_64-only; populating them
for aarch64 (which follows the asm-generic numbering) is the remaining work
for a true aarch64 deployment.
Why this ordering¶
- Landlock must precede seccomp so the
landlock_*syscalls themselves are not blocked. - Capabilities drop before seccomp so a compromised setup path cannot re-acquire privilege.
- Cgroup attachment happens in the parent before the child reaches its policy-application phase, so memory pressure during setup is bounded.