Col-Bandit

Adaptive Pruning for Late-Interaction Retrieval
via Multi-Armed Bandit Theory

Roi Pony, Adi Raz, Oshri Naparstek, Idan Friedman, Udi Barzelay
IBM Research Israel

Multi-vector late-interaction retrievers such as ColBERT achieve state-of-the-art retrieval quality, but their query-time cost is dominated by exhaustively computing token-level interactions (MaxSim) for every candidate document. We introduce Col-Bandit, a query-time pruning algorithm that casts reranking as a finite-population Top-K identification problem. Col-Bandit maintains uncertainty-aware bounds over partially observed document scores and adaptively reveals only the entries needed to determine the top results. It operates as a zero-shot, drop-in layer requiring no index modifications or retraining.

Paper Code — coming soon
Background and Motivation

How ColBERT Scores Documents

Walk through the late-interaction scoring step by step. Each cell in the score matrix is one MaxSim operation — and Col-Bandit's target for pruning.

How It Works

Col-Bandit Algorithm

Col-Bandit reframes late-interaction reranking as a top-K identification problem, using confidence bounds to adaptively prune the score matrix at query time.

Standard ColBERT

Computes all N × T MaxSim values. For 1000 documents and 32 query tokens, that's 32,000 lookups — even for documents clearly not in the top-K.

SCORE MATRIX H [N×T]
100% of cells computed

Col-Bandit

Treats each document as a bandit arm. Samples tokens adaptively via LUCB — focusing on competitive candidates and ignoring obvious losers. Stops when top-K is separated.

SCORE MATRIX H [N×T]
~16% of cells → same top-K
Experiments

Results

Evaluated on 5 BEIR text datasets (ColBERTv2, Jina-ColBERT-v2) and 4 REAL-MM-RAG multimodal datasets (Granite Vision Embedding 3.2).

Overlap@K measures ranking fidelity: the fraction of documents in Col-Bandit's top-K that match full ColBERT's top-K. An Overlap@5 of 90% means 4.5 out of 5 documents are correctly identified.

Table 1: Col-Bandit Efficiency Across Models

Compute shows the percentage of score matrix cells that Col-Bandit must compute to achieve 90% and 95% Overlap@K. Savings shows the compute reduction vs. full ColBERT (100% compute).

Model Overlap@1 Overlap@5
90% 95% 90% 95% 90% 95% 90% 95%
Compute Savings Compute Savings
ColBERTv2
(BEIR)
13% 14% 7.7× 7.1× 28% 33% 3.6× 3.1×
Jina-ColBERT-v2
(BEIR)
11% 14% 9.1× 7.1× 26% 34% 3.8× 3.0×
Granite Vision
(REAL-MM-RAG)
16% 18% 6.3× 5.9× 31% 41% 3.2× 2.5×
11-18%
Compute for Overlap@1
Range across models: 11-16% (90%), 14-18% (95%)
26-41%
Compute for Overlap@5
Range across models: 26-31% (90%), 33-41% (95%)
5.9×-9.1×
Savings for Overlap@1
Range across models: 7.1×-9.1× (90%), 5.9×-7.1× (95%)
2.5×-3.8×
Savings for Overlap@5
Range across models: 3.6×-3.8× (90%), 2.5×-3.1× (95%)
Interactive

Watch It Run

Replay an actual Col-Bandit execution step by step. The left shows ColBERT's full scoring; the right shows Col-Bandit's adaptive approach.

Reference

Citation

@misc{pony2026colbanditzeroshotquerytimepruning,
  title   = {Col-Bandit: Zero-Shot Query-Time Pruning
             for Late-Interaction Retrieval},
  author  = {Roi Pony and Adi Raz and Oshri Naparstek
             and Idan Friedman and Udi Barzelay},
  year    = {2026},
  eprint  = {2602.02827},
  archivePrefix = {arXiv},
  primaryClass  = {cs.IR},
  url     = {https://arxiv.org/abs/2602.02827}
}