QuitoBench

Model	Category	Mean Rank	Mean MAE
CrossFormer	DL	3.05	2.67	2.86	0.282	0.275	0.279
Chronos-2	FM	3.21	3.51	3.36	0.310	0.317	0.314
TimesFM-2.5	FM	4.21	4.21	4.21	0.319	0.319	0.319
PatchTST	DL	4.37	4.34	4.35	0.299	0.298	0.299
TiRex	FM	4.36	4.36	4.36	0.322	0.322	0.322
iTransformer	DL	4.56	4.78	4.67	0.299	0.302	0.301
TSMixer	DL	5.58	5.43	5.51	0.313	0.309	0.311
DLinear	DL	7.24	7.29	7.26	0.368	0.371	0.369
ES	BASELINE	9.18	9.18	9.18	0.695	0.695	0.695
SNaive	BASELINE	9.25	9.25	9.25	0.675	0.675	0.675

Analysis I: Scaling Laws for Data and Model Size

More data beats more parameters: CrossFormer’s MAE drops 66% (0.725→0.248) when scaling training data from 10K to 100M tokens, while TimesFM-2.5 improves 24% (0.849→0.647). Model scaling shows diminishing returns—both architectures plateau beyond 1M parameters.

Scaling Dimension	Scale	CrossFormer MAE	TimesFM-2.5 MAE
Data (tokens)	10K	0.725	0.849
	1M	0.456	0.735
	100M	0.248	0.647
Model (params)	10K	0.602	0.821
	1M	0.456	0.735
	100M	0.456	0.735

Data scaling yields far larger gains than model scaling for both architectures. CrossFormer (1M params, DL) vs TimesFM-2.5 (200M params, FM).

Analysis II: Context-Length Crossover

Context-length crossover: DL models win by 24.6% at short context (L=96), but FM models overtake by 14.8–22.0% at longer contexts (L≥576). FM models improve 43–50% from L=96 to L=1024, whereas DL models improve only 7–12%.

Context Length (L)	FM MAE	DL MAE	Gap	Winner
96	0.455	0.343	−24.6%	Deep Learning
576	0.256	0.293	+14.8%	Foundation
1024	0.245	0.299	+22.0%	Foundation

Gap is FM-relative; positive values favour foundation models.

Analysis III: Forecast-Horizon Robustness

Horizon degradation: DLinear is the most horizon-robust model (+14.8% from H=48 to H=512), though at the cost of higher baseline MAE. Most models degrade 26–37%.

Model	Category	H=48	H=288	H=512	Δ(48→288)	Δ(48→512)
CrossFormer	DL	0.237	0.283	0.317	+19.3%	+33.9%
PatchTST	DL	0.252	0.300	0.344	+19.0%	+36.7%
iTransformer	DL	0.260	0.306	0.335	+17.6%	+28.5%
TSMixer	DL	0.273	0.316	0.345	+15.8%	+26.3%
DLinear	DL	0.345	0.367	0.396	+6.5%	+14.8%
Chronos-2	FM	0.262	0.321	0.358	+22.8%	+37.0%
TimesFM-2.5	FM	0.271	0.329	0.358	+21.3%	+32.1%
TiRex	FM	0.276	0.331	0.361	+19.9%	+30.7%

Percentage degradation relative to H=48. DLinear shows smallest degradation at the cost of higher baseline MAE.

Analysis IV: TSF Regime — Specialization

Complementary specialization: At the category level, FM wins 6 of 8 cells; however, the best single DL model (CrossFormer) beats the best FM (Chronos-2) in 4 of 8 cells—primarily those without seasonality—with up to 38.4% advantage. Regime labels follow Trend–Seasonality–Forecastability convention (threshold 0.4).

TSF Regime	Trend	Seasonality	Forecastability	CrossFormer MAE	Chronos-2 MAE	Winner
High–High–High	High	High	High	0.165	0.163	FM
High–High–Low	High	High	Low	0.356	0.353	FM
High–Low–High	High	Low	High	0.180	0.349	DL (+38.4%)
High–Low–Low	High	Low	Low	0.600	0.628	DL
Low–High–High	Low	High	High	0.199	0.197	FM
Low–High–Low	Low	High	Low	0.239	0.235	FM
Low–Low–High	Low	Low	High	0.154	0.207	DL (+17.7%)
Low–Low–Low	Low	Low	Low	0.370	0.397	DL

CrossFormer = best DL model, Chronos-2 = best FM. Highlighted DL wins show percentage advantage.

Analysis IV: TSF Regime — Forecastability

Forecastability dominates: HIGH_LOW_LOW is 3.64× harder than the easiest group (HIGH_HIGH_HIGH), demonstrating forecastability's dominant effect on prediction difficulty. The gap dwarfs the trend (1.32×) and seasonality (1.51×) effects.

Rank	Group	Trend	Seasonality	Forecastability	Mean MAE
1	HIGH_HIGH_HIGH	High	High	High	0.205
2	LOW_LOW_HIGH	Low	Low	High	0.220
3	LOW_HIGH_HIGH	Low	High	High	0.299
4	LOW_HIGH_LOW	Low	High	Low	0.359
5	HIGH_LOW_HIGH	High	Low	High	0.376
6	LOW_LOW_LOW	Low	Low	Low	0.456
7	HIGH_HIGH_LOW	High	High	Low	0.478
8	HIGH_LOW_LOW	High	Low	Low	0.749

Ranked by mean MAE across all models. Rank 1 = easiest, Rank 8 = hardest.

Analysis IV: TSF Regime — Sensitivity

FM models are more robust: Deep learning models show higher sensitivity to forecastability (2.0–2.3×) than foundation models (1.7–1.8×), indicating greater robustness of pre-trained representations for unpredictable series.

Model	HIGH Forecast MAE	LOW Forecast MAE	Ratio
PatchTST	0.185	0.420	2.28×
iTransformer	0.186	0.422	2.26×
TSMixer	0.194	0.436	2.25×
CrossFormer	0.174	0.390	2.24×
DLinear	0.245	0.501	2.04×
SNaive	0.517	0.843	1.63×
ES	0.550	0.850	1.55×
TimesFM-2.5	0.235	0.409	1.74×
TiRex	0.237	0.413	1.74×
Chronos-2	0.230	0.403	1.75×

Ratio = LOW / HIGH MAE. Higher ratio indicates greater sensitivity to forecastability.

Analysis IV: TSF Regime — Pathological HIGH_LOW_LOW

Hardest regime: Even the best model (CrossFormer, MAE 0.600) performs 3× worse than on easy series. Statistical baselines fail catastrophically (+77–82% gap from best).

Rank	Model	MAE	Gap from Best
1	CrossFormer	0.600	—
2	Chronos-2	0.628	+4.6%
3	TimesFM-2.5	0.633	+5.5%
4	TiRex	0.652	+8.6%
5	iTransformer	0.656	+9.3%
6	PatchTST	0.669	+11.5%
7	TSMixer	0.691	+15.2%
8	DLinear	0.805	+34.3%
9	ES	1.061	+77.0%
10	SNaive	1.091	+81.9%

HIGH_LOW_LOW = high trend, low seasonality, low forecastability. Gap is relative to CrossFormer (best).

Analysis V: Parameter Efficiency

59× fewer parameters: DL models (avg. 1.9M params) achieve statistically indistinguishable aggregate performance from FM models (avg. 110M params). CrossFormer at 1M params outranks every foundation model including Chronos-2 at 100M.

Metric	Foundation Models	Deep Learning	ROI
Avg. Params (M)	110	1.9	59× fewer
Best MAE	0.3138	0.2789	DL −11%
Mean MAE	0.3185	0.3117	DL −2%
MAE at L=96	0.4551	0.3432	DL −25%
MAE at L≥576	0.2502	0.2960	FM −15%
MAE / M Params	0.0029	0.1676	57.9×
Rank / M Params	0.0361	2.6505	73.3×

FM avg. 110M params vs. DL avg. 1.9M params. Higher MAE/M and Rank/M values indicate better per-parameter efficiency.

Analysis VI: Ranking Robustness

Cross-metric consistency: MAE and MSE rankings are highly correlated (Spearman ρ = 0.733 aggregate, mean 0.847 per configuration). CrossFormer retains the top rank under both metrics.
Cross-benchmark consistency: QuitoBench vs. Timer Spearman ρ = 0.865 (p < 0.01, 95% CI: [0.52, 0.97]), with DL-only correlation rising to 0.891 and 7/8 regimes sharing the same best DL model.

Model	Category	Timer Rank	Quito Rank	Change	Consistency Tier
TiRex	FM	1	5	−4	Tier 3 (Inconsistent)
Chronos-2	FM	2	2	0	Tier 1 (Highly Consistent)
TimesFM-2.5	FM	3	3	0	Tier 1 (Highly Consistent)
CrossFormer	DL	4	1	+3	Tier 3 (Inconsistent)
TSMixer	DL	5	7	−2	Tier 2 (Moderately Consistent)
iTransformer	DL	6	6	0	Tier 1 (Highly Consistent)
PatchTST	DL	7	4	+3	Tier 3 (Inconsistent)
DLinear	DL	8	8	0	Tier 1 (Highly Consistent)
SNaive	BASELINE	9	10	−1	Tier 2 (Moderately Consistent)
ES	BASELINE	10	9	+1	Tier 2 (Moderately Consistent)

Cross-benchmark Spearman ρ = 0.865. Rank changes ≤ 1 are Tier 1, 2 are Tier 2, and ≥ 3 are Tier 3.

News

About QuitoBench

Main Results: Overall Leaderboard