A High-Quality Open Time Series Forecasting Benchmark

QuitoBench

Siqiao Xue*‡, Zhaoyang Zhu*, Wei Zhang, Rongyao Cai, Rui Wang, Yixiang Mu, Fan Zhou, Jianguo Li, Peng Di, Hang Yu

*Equal contribution   Corresponding author   Work done at Alipay

News

Visitor count

About QuitoBench

Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. Quito is a billion-scale CloudOps time series corpus (1.6B tokens) from Alipay’s production infrastructure. QuitoBench is an evaluation benchmark with balanced coverage across eight trend×seasonality×forecastability (TSF) regimes—a difficulty-centric design that captures forecasting-relevant properties rather than application-defined domain labels.

Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we provide four key insights:

Main Results: Overall Leaderboard

Evaluation protocol: Mean MAE and mean rank across 232,200 instances (3 context lengths × 3 horizons × 2 variate modes). Lower rank and MAE are better.
Category
Model Category Mean Rank Mean MAE
MV UV Overall MV UV Overall
CrossFormer DL 3.05 2.67 2.86 0.282 0.275 0.279
Chronos-2 FM 3.21 3.51 3.36 0.310 0.317 0.314
TimesFM-2.5 FM 4.21 4.21 4.21 0.319 0.319 0.319
PatchTST DL 4.37 4.34 4.35 0.299 0.298 0.299
TiRex FM 4.36 4.36 4.36 0.322 0.322 0.322
iTransformer DL 4.56 4.78 4.67 0.299 0.302 0.301
TSMixer DL 5.58 5.43 5.51 0.313 0.309 0.311
DLinear DL 7.24 7.29 7.26 0.368 0.371 0.369
ES BASELINE 9.18 9.18 9.18 0.695 0.695 0.695
SNaive BASELINE 9.25 9.25 9.25 0.675 0.675 0.675

Click column headers to sort. DL = Deep Learning, FM = Foundation Model.

Analysis I: Scaling Laws for Data and Model Size

More data beats more parameters: CrossFormer’s MAE drops 66% (0.725→0.248) when scaling training data from 10K to 100M tokens, while TimesFM-2.5 improves 24% (0.849→0.647). Model scaling shows diminishing returns—both architectures plateau beyond 1M parameters.
Scaling Dimension Scale CrossFormer MAE TimesFM-2.5 MAE
Data (tokens) 10K 0.725 0.849
1M 0.456 0.735
100M 0.248 0.647
Model (params) 10K 0.602 0.821
1M 0.456 0.735
100M 0.456 0.735

Data scaling yields far larger gains than model scaling for both architectures. CrossFormer (1M params, DL) vs TimesFM-2.5 (200M params, FM).

Analysis II: Context-Length Crossover

Context-length crossover: DL models win by 24.6% at short context (L=96), but FM models overtake by 14.8–22.0% at longer contexts (L≥576). FM models improve 43–50% from L=96 to L=1024, whereas DL models improve only 7–12%.
Context Length (L) FM MAE DL MAE Gap Winner
96 0.455 0.343 −24.6% Deep Learning
576 0.256 0.293 +14.8% Foundation
1024 0.245 0.299 +22.0% Foundation

Gap is FM-relative; positive values favour foundation models.

Analysis III: Forecast-Horizon Robustness

Horizon degradation: DLinear is the most horizon-robust model (+14.8% from H=48 to H=512), though at the cost of higher baseline MAE. Most models degrade 26–37%.
Model Category H=48 H=288 H=512 Δ(48→288) Δ(48→512)
CrossFormer DL 0.237 0.283 0.317 +19.3% +33.9%
PatchTST DL 0.252 0.300 0.344 +19.0% +36.7%
iTransformer DL 0.260 0.306 0.335 +17.6% +28.5%
TSMixer DL 0.273 0.316 0.345 +15.8% +26.3%
DLinear DL 0.345 0.367 0.396 +6.5% +14.8%
Chronos-2 FM 0.262 0.321 0.358 +22.8% +37.0%
TimesFM-2.5 FM 0.271 0.329 0.358 +21.3% +32.1%
TiRex FM 0.276 0.331 0.361 +19.9% +30.7%

Percentage degradation relative to H=48. DLinear shows smallest degradation at the cost of higher baseline MAE.

Analysis IV: TSF Regime — Specialization

Complementary specialization: At the category level, FM wins 6 of 8 cells; however, the best single DL model (CrossFormer) beats the best FM (Chronos-2) in 4 of 8 cells—primarily those without seasonality—with up to 38.4% advantage. Regime labels follow Trend–Seasonality–Forecastability convention (threshold 0.4).
TSF Regime Trend Seasonality Forecastability CrossFormer MAE Chronos-2 MAE Winner
High–High–High High High High 0.165 0.163 FM
High–High–Low High High Low 0.356 0.353 FM
High–Low–High High Low High 0.180 0.349 DL (+38.4%)
High–Low–Low High Low Low 0.600 0.628 DL
Low–High–High Low High High 0.199 0.197 FM
Low–High–Low Low High Low 0.239 0.235 FM
Low–Low–High Low Low High 0.154 0.207 DL (+17.7%)
Low–Low–Low Low Low Low 0.370 0.397 DL

CrossFormer = best DL model, Chronos-2 = best FM. Highlighted DL wins show percentage advantage.

Analysis IV: TSF Regime — Forecastability

Forecastability dominates: HIGH_LOW_LOW is 3.64× harder than the easiest group (HIGH_HIGH_HIGH), demonstrating forecastability's dominant effect on prediction difficulty. The gap dwarfs the trend (1.32×) and seasonality (1.51×) effects.
Rank Group Trend Seasonality Forecastability Mean MAE
1 HIGH_HIGH_HIGH High High High 0.205
2 LOW_LOW_HIGH Low Low High 0.220
3 LOW_HIGH_HIGH Low High High 0.299
4 LOW_HIGH_LOW Low High Low 0.359
5 HIGH_LOW_HIGH High Low High 0.376
6 LOW_LOW_LOW Low Low Low 0.456
7 HIGH_HIGH_LOW High High Low 0.478
8 HIGH_LOW_LOW High Low Low 0.749

Ranked by mean MAE across all models. Rank 1 = easiest, Rank 8 = hardest.

Analysis IV: TSF Regime — Sensitivity

FM models are more robust: Deep learning models show higher sensitivity to forecastability (2.0–2.3×) than foundation models (1.7–1.8×), indicating greater robustness of pre-trained representations for unpredictable series.
Model HIGH Forecast MAE LOW Forecast MAE Ratio
PatchTST 0.185 0.420 2.28×
iTransformer 0.186 0.422 2.26×
TSMixer 0.194 0.436 2.25×
CrossFormer 0.174 0.390 2.24×
DLinear 0.245 0.501 2.04×
SNaive 0.517 0.843 1.63×
ES 0.550 0.850 1.55×
TimesFM-2.5 0.235 0.409 1.74×
TiRex 0.237 0.413 1.74×
Chronos-2 0.230 0.403 1.75×

Ratio = LOW / HIGH MAE. Higher ratio indicates greater sensitivity to forecastability.

Analysis IV: TSF Regime — Pathological HIGH_LOW_LOW

Hardest regime: Even the best model (CrossFormer, MAE 0.600) performs 3× worse than on easy series. Statistical baselines fail catastrophically (+77–82% gap from best).
Rank Model MAE Gap from Best
1 CrossFormer 0.600
2 Chronos-2 0.628 +4.6%
3 TimesFM-2.5 0.633 +5.5%
4 TiRex 0.652 +8.6%
5 iTransformer 0.656 +9.3%
6 PatchTST 0.669 +11.5%
7 TSMixer 0.691 +15.2%
8 DLinear 0.805 +34.3%
9 ES 1.061 +77.0%
10 SNaive 1.091 +81.9%

HIGH_LOW_LOW = high trend, low seasonality, low forecastability. Gap is relative to CrossFormer (best).

Analysis V: Parameter Efficiency

59× fewer parameters: DL models (avg. 1.9M params) achieve statistically indistinguishable aggregate performance from FM models (avg. 110M params). CrossFormer at 1M params outranks every foundation model including Chronos-2 at 100M.
Metric Foundation Models Deep Learning ROI
Avg. Params (M) 110 1.9 59× fewer
Best MAE 0.3138 0.2789 DL −11%
Mean MAE 0.3185 0.3117 DL −2%
MAE at L=96 0.4551 0.3432 DL −25%
MAE at L≥576 0.2502 0.2960 FM −15%
MAE / M Params 0.0029 0.1676 57.9×
Rank / M Params 0.0361 2.6505 73.3×

FM avg. 110M params vs. DL avg. 1.9M params. Higher MAE/M and Rank/M values indicate better per-parameter efficiency.

Analysis VI: Ranking Robustness

Cross-metric consistency: MAE and MSE rankings are highly correlated (Spearman ρ = 0.733 aggregate, mean 0.847 per configuration). CrossFormer retains the top rank under both metrics.
Cross-benchmark consistency: QuitoBench vs. Timer Spearman ρ = 0.865 (p < 0.01, 95% CI: [0.52, 0.97]), with DL-only correlation rising to 0.891 and 7/8 regimes sharing the same best DL model.
Model Category Timer Rank Quito Rank Change Consistency Tier
TiRex FM 1 5 −4 Tier 3 (Inconsistent)
Chronos-2 FM 2 2 0 Tier 1 (Highly Consistent)
TimesFM-2.5 FM 3 3 0 Tier 1 (Highly Consistent)
CrossFormer DL 4 1 +3 Tier 3 (Inconsistent)
TSMixer DL 5 7 −2 Tier 2 (Moderately Consistent)
iTransformer DL 6 6 0 Tier 1 (Highly Consistent)
PatchTST DL 7 4 +3 Tier 3 (Inconsistent)
DLinear DL 8 8 0 Tier 1 (Highly Consistent)
SNaive BASELINE 9 10 −1 Tier 2 (Moderately Consistent)
ES BASELINE 10 9 +1 Tier 2 (Moderately Consistent)

Cross-benchmark Spearman ρ = 0.865. Rank changes ≤ 1 are Tier 1, 2 are Tier 2, and ≥ 3 are Tier 3.

Citation

If you use QuitoBench in your research, please cite our paper:

@article{xue2026quitobench,
  title     = {QuitoBench: A High-Quality Open Time Series
               Forecasting Benchmark},
  author    = {Xue, Siqiao and Zhu, Zhaoyang and Zhang, Wei and
               Cai, Rongyao and Wang, Rui and
               Mu, Yixiang and Zhou, Fan and Li, Jianguo and Di, Peng and Yu, Hang},
  journal   = {arXiv preprint arXiv:2603.26017},
  year      = {2026},
  url       = {https://arxiv.org/abs/2603.26017}
}

Visitor Map

Visitor count