Forecast Evaluation Methodology#
Overview#
This document provides technical detail on the Bank of England’s enhanced forecast evaluation approach. The methodologies described here form the foundation of the forecast evaluation toolkit and are used to assess the Bank’s forecast performance across multiple dimensions, enabling continuous learning from forecast errors.
Source Documentation
This documentation is derived from the Bank of England’s Macro Technical Paper: Learning from forecast errors: the Bank’s enhanced approach to forecast evaluation
Evaluation Approaches#
The evaluation framework employs three complementary approaches:
Approach 1: Long-term Statistical Evaluation
Characterizes forecast performance over extended historical periods using standard statistical tests:
Accuracy metrics (RMSE, MAE)
Unbiasedness tests
Efficiency tests
Benchmarking against model alternatives
Approach 2: Benchmark Comparisons
Compares Bank forecasts against a rich set of benchmark models to control for economic conditions and shock effects:
AR(p) autoregressive models
Bayesian VAR models
COMPASS DSGE model
Random walk models
Approach 3: Targeted Analysis of Recent Errors
Interrogates specific recent forecast errors and their drivers through complementary techniques:
Distributional analysis of errors
Rolling-window fluctuation tests
Data revision analysis
Forecast Error Definition and Data#
Forecast errors are computed as the difference between outturns and forecasts:
where:
\(y_t\) is the realized value (outturn)
\(\hat{y}_{t|t-h}\) is the forecast made \(h\) quarters ahead
\(h\) is the forecast horizon (0 to 12 quarters)
Data Vintages
The toolkit accounts for data revisions by comparing forecasts against outturns published \(k\) quarters after initial release. By default, \(k=12\) is used, ensuring:
GDP data has been fully “balanced” at least twice in the ONS Blue Book
Sufficient time has elapsed for material revisions
Comparability with original forecast conditions
When the final vintage is unavailable, the latest published data is used.
Statistical Evaluation Metrics#
Accuracy Assessment#
Forecast accuracy is evaluated using two complementary metrics:
Root Mean Squared Error (RMSE)
where \(\varepsilon_i\) represents the \(i\)-th forecast error and \(N\) is the number of observations.
RMSE penalises larger errors more heavily, consistent with a quadratic loss function.
Mean Absolute Error (MAE)
MAE is less sensitive to outliers and equals the average absolute error size.
Relative Accuracy: RMSE Ratio
A ratio greater than 1.0 indicates that the denominator forecast performs better.
Diebold-Mariano Test
The Diebold-Mariano (DM) test evaluates whether differences in forecast accuracy between two models are statistically significant. The test assesses:
Null Hypothesis: Expected difference in forecast loss is zero (forecasts have equal accuracy)
Test Statistic: Computed using HAC variance estimators to account for autocorrelation at multi-step horizons
Interpretation: Positive values indicate the base model performs better; negative values indicate worse performance
The Harvey (1997) correction is applied to account for small sample sizes.
Unbiasedness Assessment#
A forecast is unbiased if it does not systematically over- or under-predict outcomes. Bias testing uses the following regression:
where:
\(\varepsilon_{t|t-h}\) is the forecast error (outturn minus forecast)
\(\beta\) is the bias coefficient (sample mean of errors)
\(u_t\) is the error term
Interpretation:
\(\beta > 0\): Forecasts systematically underestimate outcomes
\(\beta < 0\): Forecasts systematically overestimate outcomes
\(\beta \approx 0\): Forecasts are unbiased
Statistical significance is tested using a t-test with HAC standard errors, with lag order set to \(h\) (forecast horizon in quarters).
Efficiency Assessment#
Efficient forecasts utilize all available information optimally. The toolkit implements both weak and strong efficiency tests.
Weak Efficiency: Mincer-Zarnowitz Test
The Mincer-Zarnowitz regression tests whether anticipated changes are fully incorporated:
Under the null hypothesis of efficiency:
\(\alpha = 0\) (no constant term)
\(\beta = 1\) (unit coefficient)
Interpretation:
\(\alpha \neq 0\): Systematic bias in forecasts
\(\beta \neq 1\): Over- or under-reaction to available information
Strong Efficiency: Blanchard-Leigh regressions
This approach examines whether cross-variable relationships are correctly calibrated in forecasts. It tests whether misforecasts in one variable predict misforecasts in related variables, indicating potential miscalibration of economic relationships.
Benchmark Models#
The toolkit benchmarks forecasts against several model classes:
AR(p) Models
Univariate autoregressive models estimated on real-time data
Provides a mechanical statistical baseline
Bayesian VAR
Multivariate model capturing key economic relationships
COMPASS DSGE Model
Bank’s workhorse dynamic stochastic general equilibrium model
Random Walk
Model assuming no change from current value
Most naive baseline available
Benchmarking Strategy#
Benchmarks are constructed on a real-time basis, using only information available when each forecast was made. This approach ensures fair comparison (no look-ahead bias).
Practical Guidance for Users#
When to Use Different Metrics#
RMSE and Accuracy Metrics:
Use when absolute closeness to outcomes matters
Appropriate for general-purpose forecast assessment
Unbiasedness Tests:
Use when systematic over/under-prediction would be problematic
Important for policy-relevant variables where consistent bias could lead to persistent policy errors
Requires moderate sample sizes for power
Efficiency Tests:
Use when assessing whether forecasters are fully utilizing available information
Mincer-Zarnowitz for single-variable efficiency
Blanchard-Leigh for cross-variable relationship calibration
Benchmark Comparisons:
Always use when evaluating forecast value-added
Helps control for underlying economic volatility
Essential for context (is performance good or bad relative to realistic alternatives?)
Interpreting Uncertainty#
Several sources of uncertainty affect forecast evaluation:
Small Sample Sizes: Historical evaluation periods may be limited, reducing statistical power
Non-stationary Economic Relationships: Structural breaks or regime changes complicate inference
Data Revisions: Outturns change over time, affecting computed errors
Forecast Vintage Correlations: Multiple horizons from the same forecast are correlated
The toolkit mitigates these through HAC standard errors, Harvey corrections for small samples, sensitivity analysis over vintage choices, and fluctuation tests for detecting instabilities.
Technical References#
Key Literature
Diebold, F. and Mariano, R. (1995). Comparing Predictive Accuracy. Journal of Business & Economic Statistics, 13(3), 253–263.
Mincer, J. A. and Zarnowitz, V. (1969). The Evaluation of Economic Forecasts. In Economic Forecasts and Expectations, NBER, pp. 3–46.
Nordhaus, W. D. (1987). Forecasting Efficiency: Concepts and Applications. Review of Economics and Statistics, 69(4), 667–674.
Blanchard, O. J. and Leigh, D. (2013). Growth Forecast Errors and Fiscal Multipliers. American Economic Review, 103(3), 117–120.
Harvey, D., Leybourne, S. and Newbold, P. (1997). Testing the equality of prediction mean squared errors. International Journal of Forecasting, 13(2), 281–291.
Harvey, D. I., Leybourne, S. J. and Whitehouse, E. J. (2017). Forecast evaluation tests and negative long-run variance estimates in small samples. International Journal of Forecasting, 33(4), 833–847.
Bank of England Resources
Bank of England (2026). Forecast Evaluation Report: January 2026.
Kanngiesser, D. and Willems, T. (2024). Forecast accuracy and efficiency at the Bank of England. Bank of England Staff Working Paper.
Independent Evaluation Office (2015). Evaluating forecast performance. Bank of England.
Package Documentation
See the User Guide and API Reference for details on installation, usage, and API reference.
GitHub Repository: bank-of-england/forecast_evaluation