Skip to content

Kernel Configuration

Linux kernel features and settings for optimal numaperf operation.

Quick Compatibility Check

# Check if NUMA is enabled
numactl --hardware

# Check kernel version (4.5+ recommended)
uname -r

# Check available capabilities
capsh --print

# Verify NUMA sysfs is present
ls /sys/devices/system/node/

Kernel Version Requirements

Feature Minimum Kernel Notes
Basic NUMA topology 2.6.18 /sys/devices/system/node/
mbind() / set_mempolicy() 2.6.7 Memory policy syscalls
CPU affinity 2.5.8 sched_setaffinity()
Transparent huge pages 2.6.38 /sys/kernel/mm/transparent_hugepage/
move_pages() 2.6.18 Page migration queries
MPOL_PREFERRED_MANY 5.15 Preferred policy with multiple nodes
perf_event_open 2.6.31 Hardware performance counters

Recommended: Linux 5.4+ for full functionality.

Capability Requirements

Capability Required For Soft Mode Behavior
None (unprivileged) Topology discovery, MPOL_PREFERRED, thread affinity Works fully
CAP_SYS_NICE Strict CPU pinning, real-time priorities Falls back to best-effort affinity
CAP_IPC_LOCK mlock() for pinned pages Skips memory locking
CAP_SYS_ADMIN MPOL_BIND strict enforcement, perf counters Falls back to preferred policy

Granting Capabilities

# Grant capabilities to a binary
sudo setcap 'cap_sys_nice,cap_ipc_lock,cap_sys_admin+ep' ./your_binary

# Verify capabilities
getcap ./your_binary

# Run with specific capabilities
sudo capsh --caps='cap_sys_nice+eip' -- -c './your_binary'

Unprivileged Features

These work without elevated privileges:

  • Topology discovery: Full access to /sys/devices/system/node/
  • CPU affinity: Setting affinity to any CPU the process can access
  • MPOL_PREFERRED: Preferred node hint (kernel may ignore under pressure)
  • MPOL_LOCAL: Use current thread's node
  • MPOL_INTERLEAVE: Round-robin across nodes
  • Huge pages: Transparent huge pages (if enabled system-wide)
  • Statistics: Reading /proc/self/numa_maps

Privileged Features

Feature Requirement Alternative
MPOL_BIND strict CAP_SYS_ADMIN or root Use MPOL_PREFERRED
mlock() / mlockall() CAP_IPC_LOCK or RLIMIT_MEMLOCK Skip memory locking
Real-time scheduling CAP_SYS_NICE Use normal priorities
Hardware perf counters CAP_PERFMON or perf_event_paranoid <= 1 Use software counters
Explicit huge pages hugetlb group membership Use transparent huge pages

Sysctl Settings

These kernel parameters affect NUMA behavior:

# View current settings
sysctl vm.zone_reclaim_mode
sysctl kernel.numa_balancing
sysctl kernel.perf_event_paranoid
Setting Value Reason
vm.zone_reclaim_mode 0 Prevents premature reclaim on local node
kernel.numa_balancing 0 Prevents kernel from moving pages unexpectedly
kernel.perf_event_paranoid 1 or lower Enables perf counters for non-root

Applying Settings

Temporarily:

sudo sysctl -w vm.zone_reclaim_mode=0
sudo sysctl -w kernel.numa_balancing=0

Permanently:

cat << 'EOF' | sudo tee /etc/sysctl.d/99-numa.conf
# NUMA optimization settings
vm.zone_reclaim_mode = 0
kernel.numa_balancing = 0
EOF

sudo sysctl --system

Transparent Huge Pages

# Check THP status
cat /sys/kernel/mm/transparent_hugepage/enabled

# Options: always, madvise, never
# Recommended: madvise (explicit opt-in per region)
echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

THP Recommendations

Workload Setting Reason
General server madvise Opt-in per allocation
Database never Predictable latency
HPC always Maximum throughput

Explicit Huge Pages (hugetlbfs)

For 2 MB huge pages:

# Check current huge page count
cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Reserve huge pages at runtime
echo 1024 | sudo tee /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

# Reserve at boot (add to /etc/sysctl.conf)
vm.nr_hugepages = 1024

Mounting hugetlbfs

# Mount hugetlbfs
sudo mount -t hugetlbfs none /mnt/hugepages

# Grant access to a group
sudo chown root:hugetlb /mnt/hugepages
sudo chmod 1770 /mnt/hugepages

# Add users to hugetlb group
sudo usermod -aG hugetlb $USER

Persistent mount (fstab)

none /mnt/hugepages hugetlbfs defaults,gid=hugetlb,mode=1770 0 0

Hard Mode Checklist

Before enabling hard mode, verify:

  • [ ] Kernel version 5.4+
  • [ ] CAP_SYS_ADMIN or root access (for MPOL_BIND)
  • [ ] CAP_SYS_NICE (for strict pinning)
  • [ ] kernel.numa_balancing=0
  • [ ] vm.zone_reclaim_mode=0
  • [ ] Sufficient huge pages reserved (if using explicit huge pages)

Kernel Configuration Options

For custom kernels, ensure these options are enabled:

CONFIG_NUMA=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=n  # Disable by default
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_HUGETLBFS=y
CONFIG_CPUSETS=y
CONFIG_CGROUP_CPUACCT=y

Verifying NUMA Support

# Check kernel config
zcat /proc/config.gz | grep -i numa

# Should show:
# CONFIG_NUMA=y
# CONFIG_NUMA_BALANCING=y

# Check boot parameters
cat /proc/cmdline | grep numa
# If "numa=off" appears, NUMA is disabled

Platform-Specific Notes

Intel

  • SNC (Sub-NUMA Clustering): Enable in BIOS for more NUMA nodes per socket
  • Memory Interleaving: Disable in BIOS for proper NUMA topology

AMD EPYC

  • NPS (NUMA Per Socket): Configure in BIOS for NUMA granularity
  • CCX/CCD awareness: Each compute complex may be a NUMA node

AWS EC2

  • Metal instances provide full NUMA access
  • Check instance documentation for NUMA support

Containers

# Docker: Check cgroup NUMA restrictions
cat /sys/fs/cgroup/cpuset/cpuset.mems

# Kubernetes: Use topology manager
# kubelet --topology-manager-policy=single-numa-node

See Also