Beyond Retrieval: A Multitask Benchmark and Model for Code Search

News

[2026-05-07] CoREB-Reranker (merged weights) released on Hugging Face.
[2026-05-06] CoREB paper released on arXiv.
[2026-05-06] Code (coreb), model (coreb-code-reranker) and benchmark set (coreb) open-sourced on GitHub & Hugging Face.

About CoREB

Code search now underpins not only developer-facing tools but also modern AI coding agents (e.g., SWE-agent, OpenHands, Cursor). Yet existing benchmarks evaluate only the embedding stage, ignoring the reranker and developer-style queries that production pipelines actually use, and additionally suffer from data contamination, label noise, and degenerate binary relevance.

CoREB is a contamination-limited, multitask Code Retrieval and Reranking Benchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. It is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code.

Key findings:

Code-specialised embeddings dominate code-to-code retrieval (~2× over general encoders), yet no single model wins all three tasks
Short keyword queries (the format closest to real developer search) collapse every model to near-zero nDCG@10
Off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks
Fine-tuned CoREB-Reranker is the only reranker that achieves consistent gains across all three tasks

Main Results: Per-Task Retrieval

Retrieval on CoREB v202602 (graded qrels, relevance_level=2). nDCG and Recall denote nDCG@10 and Recall@10. Overall is query-count-weighted. Hard negatives (rel=1) penalize nDCG but do not count toward Recall.

Model	Category	Text-to-Code		Code-to-Text		Code-to-Code		Overall
Model	Category	nDCG	Recall	nDCG	Recall	nDCG	Recall	nDCG	Recall
C2LLM-7B	CODE	0.435	0.765	0.822	0.838	0.661	0.998	0.639	0.815
C2LLM-0.5B	CODE	0.429	0.713	0.800	0.833	0.664	0.978	0.625	0.789
GemEmb-2	CLOSED	0.420	0.755	0.764	0.838	0.709	1.000	0.607	0.811
Jina-code-emb-1.5b	CODE	0.405	0.713	0.767	0.827	0.686	0.976	0.600	0.786
F2LLM-4B	GENERAL	0.400	0.694	0.788	0.825	0.515	0.839	0.597	0.767
Jina-code-emb-0.5b	CODE	0.397	0.679	0.742	0.808	0.699	0.980	0.585	0.761
Jina-emb-v4	GENERAL	0.398	0.676	0.756	0.805	0.546	0.895	0.583	0.753
Qwen3-Emb-4B	GENERAL	0.399	0.651	0.763	0.813	0.386	0.713	0.576	0.734
F2LLM-1.7B	GENERAL	0.377	0.625	0.761	0.819	0.408	0.668	0.567	0.722
F2LLM-0.6B	GENERAL	0.348	0.583	0.742	0.795	0.336	0.576	0.540	0.686
Qwen3-Emb-8B	GENERAL	0.341	0.551	0.726	0.786	0.299	0.535	0.526	0.664
Qwen3-Emb-0.6B	GENERAL	0.350	0.579	0.665	0.757	0.419	0.719	0.508	0.675
GemEmb-2	CLOSED	0.434	0.749	0.784	0.812	0.698	1.000	0.624	0.805
C2LLM-7B	CODE	0.443	0.753	0.766	0.812	0.659	0.997	0.615	0.806
C2LLM-0.5B	CODE	0.430	0.716	0.725	0.810	0.656	0.970	0.591	0.786
Jina-code-emb-1.5b	CODE	0.414	0.705	0.735	0.806	0.671	0.973	0.590	0.780
Jina-code-emb-0.5b	CODE	0.386	0.650	0.725	0.792	0.677	0.963	0.574	0.749
F2LLM-4B	GENERAL	0.407	0.695	0.735	0.808	0.500	0.766	0.568	0.755
Qwen3-Emb-4B	GENERAL	0.390	0.626	0.704	0.800	0.392	0.603	0.535	0.704
F2LLM-1.7B	GENERAL	0.383	0.603	0.690	0.776	0.383	0.562	0.525	0.679
F2LLM-0.6B	GENERAL	0.344	0.545	0.641	0.762	0.334	0.491	0.480	0.640
Qwen3-Emb-8B	GENERAL	0.328	0.521	0.635	0.752	0.320	0.450	0.469	0.620
Qwen3-Emb-0.6B	GENERAL	0.349	0.541	0.597	0.731	0.384	0.551	0.467	0.630

Bold marks per-column best. C2LLM-7B is the strongest model overall. GemEmb-2 is a closed API; it leads on C2C nDCG and C2T/C2C Recall.

Reranker Evaluation

No off-the-shelf reranker is net-positive across all three tasks. Reranking top-128 on v202603: Jina v2 collapses code-to-text by −22.4%; a 12-point swing separates the best and worst baseline on the same task. Our fine-tuned CoREB-Reranker (Qwen3-Reranker-4B + LoRA, trained on 3.1M samples from v202602) is the only reranker that achieves consistent gains across all three tasks.

Reranker	Δ T2C	Δ C2T	Δ C2C	Net-Positive?
Jina Reranker v2	−8.3%	−22.4%	−8.8%	No
Jina Reranker v3	−2.2%	−5.0%	−0.1%	No
Qwen3-Reranker-0.6B	−0.6%	−8.2%	−2.3%	No
Qwen3-Reranker-4B	−0.1%	−3.2%	+3.3%	No
CoREB-Reranker	+1.1%	+0.8%	+5.1%	Yes

Δ nDCG@10 (%) after reranking top-128 on v202603 (train-on-v202602, test-on-v202603). CoREB-Reranker is fine-tuned from Qwen3-Reranker-4B via LoRA — available at hq-bench/coreb-code-reranker.

Visitor Map

Citation

If you use CoREB in your research, please cite our paper:

@article{xue2026coreb,
  title     = {Beyond Retrieval: A Multitask Benchmark and Model for Code Search},
  author    = {Xue, Siqiao and Liao, Zihan and Qin, Jin and Zhang, Ziyin and Mu, Yixiang and Zhou, Fan and Yu, Hang},
  journal   = {arXiv preprint arXiv:2605.04615},
  year      = {2026},
  url       = {https://arxiv.org/abs/2605.04615}
}