*Equal contribution †Corresponding author ‡Work done at Alipay
Time series forecasting is critical across finance, healthcare, and cloud computing, yet progress is constrained by a fundamental bottleneck: the scarcity of large-scale, high-quality benchmarks. Quito is a billion-scale CloudOps time series corpus (1.6B tokens) from Alipay’s production infrastructure. QuitoBench is an evaluation benchmark with balanced coverage across eight trend×seasonality×forecastability (TSF) regimes—a difficulty-centric design that captures forecasting-relevant properties rather than application-defined domain labels.
Benchmarking 10 models from deep learning, foundation models, and statistical baselines across 232,200 evaluation instances, we provide four key insights:
| Model | Category | Mean Rank | Mean MAE | ||||
|---|---|---|---|---|---|---|---|
| MV | UV | Overall | MV | UV | Overall | ||
| CrossFormer | DL | 3.05 | 2.67 | 2.86 | 0.282 | 0.275 | 0.279 |
| Chronos-2 | FM | 3.21 | 3.51 | 3.36 | 0.310 | 0.317 | 0.314 |
| TimesFM-2.5 | FM | 4.21 | 4.21 | 4.21 | 0.319 | 0.319 | 0.319 |
| PatchTST | DL | 4.37 | 4.34 | 4.35 | 0.299 | 0.298 | 0.299 |
| TiRex | FM | 4.36 | 4.36 | 4.36 | 0.322 | 0.322 | 0.322 |
| iTransformer | DL | 4.56 | 4.78 | 4.67 | 0.299 | 0.302 | 0.301 |
| TSMixer | DL | 5.58 | 5.43 | 5.51 | 0.313 | 0.309 | 0.311 |
| DLinear | DL | 7.24 | 7.29 | 7.26 | 0.368 | 0.371 | 0.369 |
| ES | BASELINE | 9.18 | 9.18 | 9.18 | 0.695 | 0.695 | 0.695 |
| SNaive | BASELINE | 9.25 | 9.25 | 9.25 | 0.675 | 0.675 | 0.675 |
Click column headers to sort. DL = Deep Learning, FM = Foundation Model.
| Scaling Dimension | Scale | CrossFormer MAE | TimesFM-2.5 MAE |
|---|---|---|---|
| Data (tokens) | 10K | 0.725 | 0.849 |
| 1M | 0.456 | 0.735 | |
| 100M | 0.248 | 0.647 | |
| Model (params) | 10K | 0.602 | 0.821 |
| 1M | 0.456 | 0.735 | |
| 100M | 0.456 | 0.735 |
Data scaling yields far larger gains than model scaling for both architectures. CrossFormer (1M params, DL) vs TimesFM-2.5 (200M params, FM).
| Context Length (L) | FM MAE | DL MAE | Gap | Winner |
|---|---|---|---|---|
| 96 | 0.455 | 0.343 | −24.6% | Deep Learning |
| 576 | 0.256 | 0.293 | +14.8% | Foundation |
| 1024 | 0.245 | 0.299 | +22.0% | Foundation |
Gap is FM-relative; positive values favour foundation models.
| Model | Category | H=48 | H=288 | H=512 | Δ(48→288) | Δ(48→512) |
|---|---|---|---|---|---|---|
| CrossFormer | DL | 0.237 | 0.283 | 0.317 | +19.3% | +33.9% |
| PatchTST | DL | 0.252 | 0.300 | 0.344 | +19.0% | +36.7% |
| iTransformer | DL | 0.260 | 0.306 | 0.335 | +17.6% | +28.5% |
| TSMixer | DL | 0.273 | 0.316 | 0.345 | +15.8% | +26.3% |
| DLinear | DL | 0.345 | 0.367 | 0.396 | +6.5% | +14.8% |
| Chronos-2 | FM | 0.262 | 0.321 | 0.358 | +22.8% | +37.0% |
| TimesFM-2.5 | FM | 0.271 | 0.329 | 0.358 | +21.3% | +32.1% |
| TiRex | FM | 0.276 | 0.331 | 0.361 | +19.9% | +30.7% |
Percentage degradation relative to H=48. DLinear shows smallest degradation at the cost of higher baseline MAE.
| TSF Regime | Trend | Seasonality | Forecastability | CrossFormer MAE | Chronos-2 MAE | Winner |
|---|---|---|---|---|---|---|
| High–High–High | High | High | High | 0.165 | 0.163 | FM |
| High–High–Low | High | High | Low | 0.356 | 0.353 | FM |
| High–Low–High | High | Low | High | 0.180 | 0.349 | DL (+38.4%) |
| High–Low–Low | High | Low | Low | 0.600 | 0.628 | DL |
| Low–High–High | Low | High | High | 0.199 | 0.197 | FM |
| Low–High–Low | Low | High | Low | 0.239 | 0.235 | FM |
| Low–Low–High | Low | Low | High | 0.154 | 0.207 | DL (+17.7%) |
| Low–Low–Low | Low | Low | Low | 0.370 | 0.397 | DL |
CrossFormer = best DL model, Chronos-2 = best FM. Highlighted DL wins show percentage advantage.
| Rank | Group | Trend | Seasonality | Forecastability | Mean MAE |
|---|---|---|---|---|---|
| 1 | HIGH_HIGH_HIGH | High | High | High | 0.205 |
| 2 | LOW_LOW_HIGH | Low | Low | High | 0.220 |
| 3 | LOW_HIGH_HIGH | Low | High | High | 0.299 |
| 4 | LOW_HIGH_LOW | Low | High | Low | 0.359 |
| 5 | HIGH_LOW_HIGH | High | Low | High | 0.376 |
| 6 | LOW_LOW_LOW | Low | Low | Low | 0.456 |
| 7 | HIGH_HIGH_LOW | High | High | Low | 0.478 |
| 8 | HIGH_LOW_LOW | High | Low | Low | 0.749 |
Ranked by mean MAE across all models. Rank 1 = easiest, Rank 8 = hardest.
| Model | HIGH Forecast MAE | LOW Forecast MAE | Ratio |
|---|---|---|---|
| PatchTST | 0.185 | 0.420 | 2.28× |
| iTransformer | 0.186 | 0.422 | 2.26× |
| TSMixer | 0.194 | 0.436 | 2.25× |
| CrossFormer | 0.174 | 0.390 | 2.24× |
| DLinear | 0.245 | 0.501 | 2.04× |
| SNaive | 0.517 | 0.843 | 1.63× |
| ES | 0.550 | 0.850 | 1.55× |
| TimesFM-2.5 | 0.235 | 0.409 | 1.74× |
| TiRex | 0.237 | 0.413 | 1.74× |
| Chronos-2 | 0.230 | 0.403 | 1.75× |
Ratio = LOW / HIGH MAE. Higher ratio indicates greater sensitivity to forecastability.
| Rank | Model | MAE | Gap from Best |
|---|---|---|---|
| 1 | CrossFormer | 0.600 | — |
| 2 | Chronos-2 | 0.628 | +4.6% |
| 3 | TimesFM-2.5 | 0.633 | +5.5% |
| 4 | TiRex | 0.652 | +8.6% |
| 5 | iTransformer | 0.656 | +9.3% |
| 6 | PatchTST | 0.669 | +11.5% |
| 7 | TSMixer | 0.691 | +15.2% |
| 8 | DLinear | 0.805 | +34.3% |
| 9 | ES | 1.061 | +77.0% |
| 10 | SNaive | 1.091 | +81.9% |
HIGH_LOW_LOW = high trend, low seasonality, low forecastability. Gap is relative to CrossFormer (best).
| Metric | Foundation Models | Deep Learning | ROI |
|---|---|---|---|
| Avg. Params (M) | 110 | 1.9 | 59× fewer |
| Best MAE | 0.3138 | 0.2789 | DL −11% |
| Mean MAE | 0.3185 | 0.3117 | DL −2% |
| MAE at L=96 | 0.4551 | 0.3432 | DL −25% |
| MAE at L≥576 | 0.2502 | 0.2960 | FM −15% |
| MAE / M Params | 0.0029 | 0.1676 | 57.9× |
| Rank / M Params | 0.0361 | 2.6505 | 73.3× |
FM avg. 110M params vs. DL avg. 1.9M params. Higher MAE/M and Rank/M values indicate better per-parameter efficiency.
| Model | Category | Timer Rank | Quito Rank | Change | Consistency Tier |
|---|---|---|---|---|---|
| TiRex | FM | 1 | 5 | −4 | Tier 3 (Inconsistent) |
| Chronos-2 | FM | 2 | 2 | 0 | Tier 1 (Highly Consistent) |
| TimesFM-2.5 | FM | 3 | 3 | 0 | Tier 1 (Highly Consistent) |
| CrossFormer | DL | 4 | 1 | +3 | Tier 3 (Inconsistent) |
| TSMixer | DL | 5 | 7 | −2 | Tier 2 (Moderately Consistent) |
| iTransformer | DL | 6 | 6 | 0 | Tier 1 (Highly Consistent) |
| PatchTST | DL | 7 | 4 | +3 | Tier 3 (Inconsistent) |
| DLinear | DL | 8 | 8 | 0 | Tier 1 (Highly Consistent) |
| SNaive | BASELINE | 9 | 10 | −1 | Tier 2 (Moderately Consistent) |
| ES | BASELINE | 10 | 9 | +1 | Tier 2 (Moderately Consistent) |
Cross-benchmark Spearman ρ = 0.865. Rank changes ≤ 1 are Tier 1, 2 are Tier 2, and ≥ 3 are Tier 3.
If you use QuitoBench in your research, please cite our paper:
@article{xue2026quitobench,
title = {QuitoBench: A High-Quality Open Time Series
Forecasting Benchmark},
author = {Xue, Siqiao and Zhu, Zhaoyang and Zhang, Wei and
Cai, Rongyao and Wang, Rui and
Mu, Yixiang and Zhou, Fan and Li, Jianguo and Di, Peng and Yu, Hang},
journal = {arXiv preprint arXiv:2603.26017},
year = {2026},
url = {https://arxiv.org/abs/2603.26017}
}