A Multitask Benchmark and Model for Code Search
*‡Work done at Alipay. †Corresponding author.
Code search now underpins not only developer-facing tools but also modern AI coding agents (e.g., SWE-agent, OpenHands, Cursor). Yet existing benchmarks evaluate only the embedding stage, ignoring the reranker and developer-style queries that production pipelines actually use, and additionally suffer from data contamination, label noise, and degenerate binary relevance.
CoREB is a contamination-limited, multitask Code Retrieval and Reranking Benchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. It is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code.
Key findings:
v202602 (graded qrels, relevance_level=2). nDCG and Recall denote nDCG@10 and Recall@10. Overall is query-count-weighted. Hard negatives (rel=1) penalize nDCG but do not count toward Recall.
| Model | Category | Text-to-Code | Code-to-Text | Code-to-Code | Overall | ||||
|---|---|---|---|---|---|---|---|---|---|
| nDCG | Recall | nDCG | Recall | nDCG | Recall | nDCG | Recall | ||
| C2LLM-7B | CODE | 0.435 | 0.765 | 0.822 | 0.838 | 0.661 | 0.998 | 0.639 | 0.815 |
| C2LLM-0.5B | CODE | 0.429 | 0.713 | 0.800 | 0.833 | 0.664 | 0.978 | 0.625 | 0.789 |
| GemEmb-2 | CLOSED | 0.420 | 0.755 | 0.764 | 0.838 | 0.709 | 1.000 | 0.607 | 0.811 |
| Jina-code-emb-1.5b | CODE | 0.405 | 0.713 | 0.767 | 0.827 | 0.686 | 0.976 | 0.600 | 0.786 |
| F2LLM-4B | GENERAL | 0.400 | 0.694 | 0.788 | 0.825 | 0.515 | 0.839 | 0.597 | 0.767 |
| Jina-code-emb-0.5b | CODE | 0.397 | 0.679 | 0.742 | 0.808 | 0.699 | 0.980 | 0.585 | 0.761 |
| Jina-emb-v4 | GENERAL | 0.398 | 0.676 | 0.756 | 0.805 | 0.546 | 0.895 | 0.583 | 0.753 |
| Qwen3-Emb-4B | GENERAL | 0.399 | 0.651 | 0.763 | 0.813 | 0.386 | 0.713 | 0.576 | 0.734 |
| F2LLM-1.7B | GENERAL | 0.377 | 0.625 | 0.761 | 0.819 | 0.408 | 0.668 | 0.567 | 0.722 |
| F2LLM-0.6B | GENERAL | 0.348 | 0.583 | 0.742 | 0.795 | 0.336 | 0.576 | 0.540 | 0.686 |
| Qwen3-Emb-8B | GENERAL | 0.341 | 0.551 | 0.726 | 0.786 | 0.299 | 0.535 | 0.526 | 0.664 |
| Qwen3-Emb-0.6B | GENERAL | 0.350 | 0.579 | 0.665 | 0.757 | 0.419 | 0.719 | 0.508 | 0.675 |
Bold marks per-column best. C2LLM-7B is the strongest model overall. GemEmb-2 is a closed API; it leads on C2C nDCG and C2T/C2C Recall.
v202603: Jina v2 collapses code-to-text by −22.4%; a 12-point swing separates the best and worst baseline on the same task. Our fine-tuned CoREB-Reranker (Qwen3-Reranker-4B + LoRA, trained on 3.1M samples from v202602) is the only reranker that achieves consistent gains across all three tasks.
| Reranker | Δ T2C | Δ C2T | Δ C2C | Net-Positive? |
|---|---|---|---|---|
| Jina Reranker v2 | −8.3% | −22.4% | −8.8% | No |
| Jina Reranker v3 | −2.2% | −5.0% | −0.1% | No |
| Qwen3-Reranker-0.6B | −0.6% | −8.2% | −2.3% | No |
| Qwen3-Reranker-4B | −0.1% | −3.2% | +3.3% | No |
| CoREB-Reranker | +1.1% | +0.8% | +5.1% | Yes |
Δ nDCG@10 (%) after reranking top-128 on v202603 (train-on-v202602, test-on-v202603). CoREB-Reranker is fine-tuned from Qwen3-Reranker-4B via LoRA — available at hq-bench/coreb-code-reranker.
If you use CoREB in your research, please cite our paper:
@article{xue2026coreb,
title = {Beyond Retrieval: A Multitask Benchmark and Model for Code Search},
author = {Xue, Siqiao and Liao, Zihan and Qin, Jin and Zhang, Ziyin and Mu, Yixiang and Zhou, Fan and Yu, Hang},
journal = {arXiv preprint arXiv:2605.04615},
year = {2026},
url = {https://arxiv.org/abs/2605.04615}
}