Beyond Retrieval

CoREB

A Multitask Benchmark and Model for Code Search

Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu, Fan Zhou, Hang Yu

*‡Work done at Alipay.   Corresponding author.

News

Visitor count

About CoREB

Code search now underpins not only developer-facing tools but also modern AI coding agents (e.g., SWE-agent, OpenHands, Cursor). Yet existing benchmarks evaluate only the embedding stage, ignoring the reranker and developer-style queries that production pipelines actually use, and additionally suffer from data contamination, label noise, and degenerate binary relevance.

CoREB is a contamination-limited, multitask Code Retrieval and Reranking Benchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. It is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code.

Key findings:

Main Results: Per-Task Retrieval

Retrieval on CoREB v202602 (graded qrels, relevance_level=2). nDCG and Recall denote nDCG@10 and Recall@10. Overall is query-count-weighted. Hard negatives (rel=1) penalize nDCG but do not count toward Recall.
Release
Category
Model Category Text-to-Code Code-to-Text Code-to-Code Overall
nDCG Recall nDCG Recall nDCG Recall nDCG Recall
C2LLM-7BCODE0.4350.7650.8220.8380.6610.9980.6390.815
C2LLM-0.5BCODE0.4290.7130.8000.8330.6640.9780.6250.789
GemEmb-2CLOSED0.4200.7550.7640.8380.7091.0000.6070.811
Jina-code-emb-1.5bCODE0.4050.7130.7670.8270.6860.9760.6000.786
F2LLM-4BGENERAL0.4000.6940.7880.8250.5150.8390.5970.767
Jina-code-emb-0.5bCODE0.3970.6790.7420.8080.6990.9800.5850.761
Jina-emb-v4GENERAL0.3980.6760.7560.8050.5460.8950.5830.753
Qwen3-Emb-4BGENERAL0.3990.6510.7630.8130.3860.7130.5760.734
F2LLM-1.7BGENERAL0.3770.6250.7610.8190.4080.6680.5670.722
F2LLM-0.6BGENERAL0.3480.5830.7420.7950.3360.5760.5400.686
Qwen3-Emb-8BGENERAL0.3410.5510.7260.7860.2990.5350.5260.664
Qwen3-Emb-0.6BGENERAL0.3500.5790.6650.7570.4190.7190.5080.675
C2LLM-7BCODE0.4430.7530.7660.8120.6590.9970.6150.806
C2LLM-0.5BCODE0.4300.7160.7250.8100.6560.9700.5910.786
Jina-code-emb-1.5bCODE0.4140.7050.7350.8060.6710.9730.5900.780
Jina-code-emb-0.5bCODE0.3860.6500.7250.7920.6770.9630.5740.749
F2LLM-4BGENERAL0.4070.6950.7350.8080.5000.7660.5680.755
Qwen3-Emb-4BGENERAL0.3900.6260.7040.8000.3920.6030.5350.704
F2LLM-1.7BGENERAL0.3830.6030.6900.7760.3830.5620.5250.679
F2LLM-0.6BGENERAL0.3440.5450.6410.7620.3340.4910.4800.640
Qwen3-Emb-8BGENERAL0.3280.5210.6350.7520.3200.4500.4690.620
Qwen3-Emb-0.6BGENERAL0.3490.5410.5970.7310.3840.5510.4670.630

Bold marks per-column best. C2LLM-7B is the strongest model overall. GemEmb-2 is a closed API; it leads on C2C nDCG and C2T/C2C Recall.

Reranker Evaluation

No off-the-shelf reranker is net-positive across all three tasks. Reranking top-128 on v202603: Jina v2 collapses code-to-text by −22.4%; a 12-point swing separates the best and worst baseline on the same task. Our fine-tuned CoREB-Reranker (Qwen3-Reranker-4B + LoRA, trained on 3.1M samples from v202602) is the only reranker that achieves consistent gains across all three tasks.
Reranker Δ T2C Δ C2T Δ C2C Net-Positive?
Jina Reranker v2 −8.3% −22.4% −8.8% No
Jina Reranker v3 −2.2% −5.0% −0.1% No
Qwen3-Reranker-0.6B −0.6% −8.2% −2.3% No
Qwen3-Reranker-4B −0.1% −3.2% +3.3% No
CoREB-Reranker +1.1% +0.8% +5.1% Yes

Δ nDCG@10 (%) after reranking top-128 on v202603 (train-on-v202602, test-on-v202603). CoREB-Reranker is fine-tuned from Qwen3-Reranker-4B via LoRA — available at hq-bench/coreb-code-reranker.

Visitor Map

Visitor count

Citation

If you use CoREB in your research, please cite our paper:

@article{xue2026coreb,
  title     = {Beyond Retrieval: A Multitask Benchmark and Model for Code Search},
  author    = {Xue, Siqiao and Liao, Zihan and Qin, Jin and Zhang, Ziyin and Mu, Yixiang and Zhou, Fan and Yu, Hang},
  journal   = {arXiv preprint arXiv:2605.04615},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.04615}
}