A/B Testing¶
Statistically compare configurations with confidence.
Goal¶
By the end of this tutorial, you will:
- Design proper A/B experiments
- Run statistically valid comparisons
- Interpret statistical significance
- Make data-driven decisions
Time: 30-45 minutes
Prerequisites¶
- Completed Parameter Sweeps
- Basic understanding of statistics (helpful)
Step 1: Why A/B Testing?¶
Simple comparisons can be misleading:
Run 1: Config A = 950, Config B = 980
→ B is better? Maybe just random variation.
Run 10 each:
Config A: 950, 962, 948, 971, 955... (mean: 958)
Config B: 980, 945, 968, 952, 975... (mean: 964)
→ Is 6-point difference real or noise?
A/B testing answers: Is the difference statistically significant?
Step 2: Basic A/B Test¶
Compare two configurations:
waremax ab-test scenario.yaml \
--control "policies.task_allocation=nearest_idle" \
--treatment "policies.task_allocation=least_busy" \
--runs 20
Output:
=== A/B Test Results ===
Control (nearest_idle):
Throughput: 952 ± 28 tasks/hr
Runs: 20
Treatment (least_busy):
Throughput: 938 ± 32 tasks/hr
Runs: 20
Comparison:
Difference: -14 tasks/hr (-1.5%)
95% CI: [-32, +4]
p-value: 0.12
Conclusion: No significant difference (p > 0.05)
The observed difference is likely due to random variation.
Step 3: Interpret Results¶
Key Statistics¶
| Statistic | Meaning |
|---|---|
| Mean ± std | Average and spread |
| Difference | Treatment - Control |
| 95% CI | Range of likely true difference |
| p-value | Probability difference is by chance |
Decision Rules¶
p < 0.05: Significant difference (reject null hypothesis)
p ≥ 0.05: No significant difference (cannot reject null)
95% CI:
- Excludes 0: Significant
- Includes 0: Not significant
Step 4: Design Good Experiments¶
Sample Size¶
More runs = more confidence:
# Minimum viable
waremax ab-test ... --runs 10
# Standard
waremax ab-test ... --runs 20
# High confidence
waremax ab-test ... --runs 50
Power Analysis¶
Calculate needed sample size:
waremax ab-test ... \
--power-analysis \
--expected-effect 0.05 \ # 5% difference
--power 0.8 # 80% chance to detect
Output:
Step 5: Multiple Metrics¶
Test multiple outcomes:
waremax ab-test scenario.yaml \
--control "routing.policy=shortest_path" \
--treatment "routing.policy=congestion_aware" \
--metrics throughput,avg_task_time,wait_time \
--runs 30
Output:
=== A/B Test Results (Multiple Metrics) ===
Metric Control Treatment Diff p-value Sig?
───────────────────────────────────────────────────────────────
Throughput 952 ± 28 985 ± 25 +3.5% 0.002 ✓
Avg task time 48.2 ± 3.1 45.8 ± 2.8 -5.0% 0.008 ✓
Wait time 8.5 ± 1.2 6.2 ± 1.0 -27.0% <0.001 ✓
Summary: Treatment (congestion_aware) significantly better
on all measured metrics.
Step 6: Handle Multiple Comparisons¶
When testing multiple metrics, adjust for false positives:
waremax ab-test scenario.yaml \
--control "config_a" \
--treatment "config_b" \
--metrics throughput,task_time,utilization,wait_time \
--correction bonferroni \ # Adjust p-values
--runs 30
Bonferroni correction:
Adjusted α = 0.05 / 4 = 0.0125
Metric p-value Adjusted sig?
Throughput 0.02 No (0.02 > 0.0125)
Task time 0.008 Yes
Utilization 0.15 No
Wait time 0.001 Yes
Step 7: Sequential Testing¶
Stop early when result is clear:
waremax ab-test scenario.yaml \
--control "old_config" \
--treatment "new_config" \
--sequential \
--max-runs 100 \
--early-stopping
Output:
Sequential A/B Test:
Run 10: Inconclusive (continue)
Run 20: Inconclusive (continue)
Run 30: Treatment winning (p=0.08, continue)
Run 40: Treatment significantly better (p=0.02)
Stopped early at run 40 (max was 100)
Saved 60% of runs while maintaining statistical validity.
Step 8: Real-World Example¶
Compare routing policies:
# Define configurations
cat > control.yaml << EOF
routing:
policy: shortest_path
EOF
cat > treatment.yaml << EOF
routing:
policy: congestion_aware
congestion_weight: 1.5
EOF
# Run A/B test
waremax ab-test base_scenario.yaml \
--control control.yaml \
--treatment treatment.yaml \
--metrics throughput,p95_task_time \
--runs 30 \
-o ab_results/
Detailed analysis:
=== Detailed A/B Analysis ===
Throughput:
Control: 952 ± 28 (min: 901, max: 1008)
Treatment: 998 ± 24 (min: 955, max: 1042)
Effect size (Cohen's d): 0.51 (medium)
95% CI for difference: [28, 64]
p-value: 0.0003
Distribution overlap: 32%
Probability treatment > control: 89%
Recommendation: Strong evidence that congestion_aware
routing improves throughput by 4-5%.
Step 9: Document Results¶
Create an A/B test report:
# A/B Test Report: Routing Policy
## Hypothesis
Congestion-aware routing improves throughput compared
to shortest-path routing.
## Setup
- Base scenario: standard.yaml
- Control: shortest_path routing
- Treatment: congestion_aware routing (weight=1.5)
- Runs: 30 per group
- Duration: 1 hour per run
## Results
| Metric | Control | Treatment | Change | p-value |
|--------|---------|-----------|--------|---------|
| Throughput | 952±28 | 998±24 | +4.8% | 0.0003 |
| P95 time | 85±8 | 72±6 | -15.3% | <0.001 |
| Wait time | 8.5±1.2 | 5.8±0.9 | -31.8% | <0.001 |
## Conclusion
Congestion-aware routing significantly improves all
measured metrics. Recommend adoption.
## Next Steps
- Test with higher traffic loads
- Tune congestion_weight parameter
Common Pitfalls¶
Not Enough Runs¶
Peeking at Results¶
Bad: Stop when result looks good
Good: Pre-define sample size, stick to it
OR use sequential testing properly
Ignoring Variance¶
Multiple Testing¶
Next Steps¶
- Benchmarking: Performance limits
- A/B Test Command Reference
- Parameter Sweeps: Explore more configurations