Multi-Armed Bandit (UCB)¶

The UCB (Upper Confidence Bound) algorithm solves the exploration-exploitation dilemma: should we compare well-known entities or explore less-compared ones?

The Problem¶

With N entities, there are N*(N-1)/2 possible pairs. We want to find the best rankings with minimal comparisons by intelligently selecting which pairs to compare next.

How UCB Works¶

For each entity, UCB calculates a score that balances:

Exploitation: Entities with uncertain ratings (might be underrated)
Exploration: Entities with few comparisons (need more data)

UCB Score Formula¶

\[UCB_i = \bar{X}_i + c \times \sqrt{\frac{\ln(n)}{n_i}}\]

Where:

\(\bar{X}_i\) = Average reward (win rate) for entity i
\(c\) = Exploration constant (typically \(\sqrt{2}\) ≈ 1.414)
\(n\) = Total comparisons
\(n_i\) = Comparisons involving entity i

Entity Selection¶

Calculate UCB score for each entity
Select the entity with highest UCB score
Pair with the entity that maximizes information gain
After comparison, update statistics

Cold-Start Handling¶

New entities have \(n_i = 0\), which would make UCB undefined. Compere handles this by:

Assigning infinite UCB to entities with zero comparisons
Ensuring new entities are prioritized for their first comparison

if state.count == 0:
    ucb_scores[state.entity_id] = float('inf')  # Prioritize exploration
else:
    ucb_scores[state.entity_id] = state.value + c * sqrt(2 * log(n) / n_i)

Example¶

Entity A: 10 comparisons, 7 wins (70% win rate)
Entity B: 2 comparisons, 1 win (50% win rate)
Entity C: 0 comparisons

Total comparisons: 12

UCB_A = 0.70 + 1.414 * sqrt(ln(12) / 10) = 0.70 + 0.70 = 1.40
UCB_B = 0.50 + 1.414 * sqrt(ln(12) / 2)  = 0.50 + 1.57 = 2.07
UCB_C = ∞ (prioritized, no comparisons yet)

Entity C would be selected first (cold-start). If C is excluded, Entity B is selected despite lower win rate because of higher uncertainty.

Configuration¶

Variable	Default	Description
`UCB_EXPLORATION_CONSTANT`	`1.414`	Controls exploration vs exploitation
`UCB_UNEXPLORED_WEIGHT`	`1000.0`	Weight for entities with 0 comparisons

Weighted Random Selection¶

Standard UCB greedily selects the highest-scoring arm. Compere uses weighted random selection:

weights = []
for entity in entities:
    score = ucb_scores.get(entity.id, 0)
    if score == float('inf'):
        weights.append(1000.0)  # High weight for unexplored
    else:
        weights.append(max(score, 0.1))

entity1 = random.choices(entities, weights=weights, k=1)[0]

This adds variety while still favoring high-UCB entities, preventing repetitive comparisons.

Convergence¶

With random pairing, ranking N entities requires O(N²) comparisons. UCB reduces this to roughly O(N log N) by:

Quickly identifying top/bottom performers
Focusing comparisons on uncertain middle rankings
Avoiding redundant comparisons between established entities

When to Use UCB¶

Scenario	Use UCB?
Minimizing total comparisons	Yes
Rapid ranking convergence	Yes
New entities added frequently	Yes
Statistical validity required	No (use random)
Unbiased sampling needed	No (use random)