Tests#
- class forecast_evaluation.tests.TestResult(df: DataFrame, id_columns: list[str] = None, metadata: dict | None = None)[source]#
Bases:
objectUniversal result object for all forecast evaluation tests.
Provides common functionality for storing test results, metadata, and methods for data manipulation, visualization, and export.
- _df#
The underlying DataFrame containing test results
- Type:
pd.DataFrame
- _metadata#
Metadata about the test including parameters, filters, and provenance
- Type:
dict
- __init__(df: DataFrame, id_columns: list[str] = None, metadata: dict | None = None)[source]#
Initialize a TestResult object.
- Parameters:
df (pd.DataFrame) – DataFrame containing test results
id_columns (list of str, optional) – List of identifier columns to reconstruct from unique_id
metadata (dict, optional) –
Dictionary containing test metadata including:
test_name : str - Name of the test function
parameters : dict - Test parameters (k, same_date_range, etc.)
filters : dict - Applied filters (source, variable, etc.)
date_range : tuple - (start_date, end_date) of the data
- describe() DataFrame[source]#
Generate descriptive statistics of test results.
- Returns:
Descriptive statistics (count, mean, std, min, max, etc.)
- Return type:
pd.DataFrame
- filter(variable: str | list[str] | None = None, source: str | list[str] | None = None, horizon: int | list[int] | None = None, **kwargs) TestResult[source]#
Filter results by specified criteria.
- Parameters:
variable (str or list of str, optional) – Variable(s) to include
source (str or list of str, optional) – Source(s) to include
horizon (int or list of int, optional) – Forecast horizon(s) to include
**kwargs – Additional column-based filters
- Returns:
New TestResult object with filtered data
- Return type:
- plot(**kwargs)[source]#
Generate appropriate visualization for this result type.
Automatically detects the test type from metadata and routes to the appropriate visualization function.
- Parameters:
**kwargs – Visualization-specific parameters (vary by test type)
- Returns:
(fig, ax) matplotlib figure and axes objects, or None
- Return type:
tuple or None
- Raises:
ValueError – If test type cannot be determined or no visualization is available
- summary() str[source]#
Generate a formatted statistical summary of the test results.
- Returns:
Formatted summary of key findings
- Return type:
str
- to_csv(path: str | None = None, **kwargs) str | None[source]#
Export results to CSV file or return as string.
- Parameters:
path (str, optional) – Output file path. If None, returns CSV as string.
**kwargs – Additional arguments passed to pd.DataFrame.to_csv()
- Returns:
CSV string if path is None, otherwise None
- Return type:
str or None
- forecast_evaluation.tests.bias_analysis(data: ForecastData, source: None | str | list[str] = None, variable: None | str | list[str] = None, k: int = 12, same_date_range: bool = True, verbose: bool = False) TestResult[source]#
Run bias tests for all unique combinations of variable, source, metric, and forecast_horizon.
This function performs systematic bias testing across all available combinations in the dataset using the evaluate_bias function. It runs regression-based bias tests and aggregates results into a comprehensive summary.
- Parameters:
data (ForecastData) – An instance of the ForecastData class containing ForecastData._main_table.
source (None, str, or list of str, default=None) – Filter for specific forecast source(s). If None, includes all sources. Can be a single source name or a list of source names.
variable (None, str, or list of str, default=None) – Filter for specific variable(s). If None, includes all variables. Can be a single variable name or a list of variable names.
k (int, default=12) – Number of revisions used to define the outturns.
same_date_range (bool, default=True) – If True, ensures consistent date ranges across sources when multiple sources are analysed. If False, uses all available data for each source independently.
verbose (bool, default=False) – If True, prints detailed results for each individual bias test. If False, only prints summary statistics at the end.
- Returns:
TestResult object containing the summary DataFrame with bias test results and metadata. The underlying DataFrame contains columns:
’source’ : str - Forecast source identifier
’variable’ : str - Variable identifier
’metric’ : str - Metric identifier
’frequency’ : str - Data frequency identifier
’forecast_horizon’ : int - Forecast horizon identifier
’bias_estimate’ : float - Estimated bias coefficient (constant term from regression)
’std_error’ : float - HAC-corrected standard error of bias estimate
’t_statistic’ : float - t-statistic for bias test
’p_value’ : float - Two-tailed p-value for bias test
’bias_conclusion’ : str - ‘Biased’ if p < 0.05, ‘Unbiased’ if p >= 0.05
’n_observations’ : int - Number of observations used in the test
’ci_lower’ : float - Lower bound of 95% confidence interval for bias estimate
’ci_upper’ : float - Upper bound of 95% confidence interval for bias estimate
- Return type:
- forecast_evaluation.tests.blanchard_leigh_horizon_analysis(data: ForecastData, source: str, outcome_variable: str, outcome_metric: Literal['levels', 'pop', 'yoy'], instrument_variable: str, instrument_metric: Literal['levels', 'pop', 'yoy'], horizons: ndarray = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), j: int = 2, frequency: Literal['Q', 'M'] = 'Q', k: int = 12, alpha: float = 0.05) TestResult[source]#
Run Blanchard-Leigh efficiency tests across multiple forecast horizons.
- Parameters:
data (ForecastData) – ForecastData object containing forecast and outturn data
source (str) – Source of the forecasts (e.g., ‘MPR’, ‘OBR’)
outcome_variable (str) – Variable for forecast error analysis
outcome_metric (Literal["levels", "pop", "yoy"]) – Metric type for the outcome variable
instrument_variable (str) – Variable used as instrument
instrument_metric (Literal["levels", "pop", "yoy"]) – Metric type for the instrument variable
horizons (np.ndarray, default=np.arange(1, 13)) – Array of forecast horizons to test
j (int, default=2) – Forecast horizon of instrument variable
frequency (Literal["Q", "M"], default='Q') – Frequency of the data (quarterly or monthly)
k (int, default=12) – Number of revisions used to define the outturn
alpha (float, default=0.05) – Significance level for confidence intervals
- Returns:
TestResult object containing a DataFrame with results for each horizon and metadata about the test parameters.
- Return type:
- forecast_evaluation.tests.compare_to_benchmark(df: DataFrame, benchmark_model: str, statistic: Literal['rmse', 'rmedse', 'mean_abs_error'] = 'rmse') DataFrame[source]#
Compare each model’s accuracy statistic to a benchmark model’s statistic.
- Parameters:
df (pandas.DataFrame) –
DataFrame containing accuracy statistics with columns:
’variable’ : str - Variable identifier
’source’ : str - Forecast source identifier
’metric’ : str - Metric identifier
’forecast_horizon’ : int - Forecast horizon identifier
’rmse’ : float - Root Mean Square Error
’rmedse’ : float - Root Median Square Error
’mean_abs_error’ : float - Mean Absolute Error
’n_observations’ : int - Number of observations
benchmark_model (str) – The model to use as the benchmark for comparison. (e.g., ‘mpr’)
statistic (str, optional) – The accuracy statistic to compare. Must be one of ‘rmse’, ‘rmedse’, or ‘mean_abs_error’. Default is ‘rmse’.
- Returns:
DataFrame with an additional column:
’{statistic}_to_benchmark’ : float - Ratio of model’s statistic to benchmark model’s statistic
- Return type:
pandas.DataFrame
- forecast_evaluation.tests.compute_accuracy_statistics(data: ForecastData, source: None | str | list[str] = None, variable: None | str | list[str] = None, k: int = 12, same_date_range: bool = True) TestResult[source]#
Calculate accuracy statistics for all unique combinations of variable, source, metric and forecast_horizon.
This function computes the Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Root Median Square Error (RMedSE) and the number of observations for each combination.
- Parameters:
data (ForecastData) – ForecastData object containing the main table with forecast accuracy data.
source (None, str, or list of str, default=None) – Filter for specific forecast source(s). If None, includes all sources. Can be a single source name or a list of source names.
variable (None, str, or list of str, default=None) – Filter for specific variable(s). If None, includes all variables. Can be a single variable name or a list of variable names.
k (int, optional, default=12) – Number of revisions used to define the outturns.
same_date_range (bool, optional, default=True) – If True, ensures consistent date ranges across sources when multiple sources are analysed. If False, uses all available data for each source independently.
- Returns:
TestResult object containing the summary DataFrame with accuracy statistics and metadata. The underlying DataFrame contains columns:
’variable’ : str - Variable identifier
’source’ : str - Forecast source identifier
’metric’ : str - Metric identifier
’frequency’ : str - Frequency identifier
’forecast_horizon’ : int - Forecast horizon identifier
’rmse’ : float - Root Mean Square Error
’rmedse’ : float - Root Median Square Error
’mean_abs_error’ : float - Mean Absolute Error
’n_observations’ : int - Number of observations used in the calculation
’start_date’ : datetime - Earliest forecast vintage date in the group
’end_date’ : datetime - Latest forecast vintage date in the group
- Return type:
- forecast_evaluation.tests.create_comparison_table(df: DataFrame, variable: str, metric: Literal['levels', 'pop', 'yoy'], frequency: Literal['Q', 'M'], benchmark_model: str, statistic: Literal['rmse', 'rmedse', 'mse', 'mean_abs_error'] = 'rmse', horizons: list[int] = [0, 1, 2, 4, 8, 12]) DataFrame[source]#
Create a comparison table showing the ratio of each model’s accuracy statistic to a benchmark model’s statistic across selected forecast horizons.
This function filters the data for a specific variable, metric and frequency combination, then creates a pivot table with forecast sources as rows and forecast horizons as columns. The values represent the ratio of each model’s accuracy statistic to the benchmark model.
- Parameters:
df (pandas.DataFrame) –
DataFrame containing accuracy statistics with columns:
’variable’ : str - Variable identifier (e.g., ‘gdpkp’, ‘cpisa’, ‘unemp’)
’source’ : str - Forecast source identifier (e.g., ‘compass conditional’, ‘mpr’)
’metric’ : str - Metric identifier (e.g., ‘yoy’, ‘levels’)
’forecast_horizon’ : int - Forecast horizon identifier (forecast horizon)
’rmse’ : float - Root Mean Square Error
’rmedse’ : float - Root Median Square Error
’mean_abs_error’ : float - Mean Absolute Error
’n_observations’ : int - Number of observations
variable (str) – Variable to analyse (e.g., ‘aweagg’, ‘cpisa’, ‘gdpkp’, ‘unemp’).
metric (str) – Metric to analyse (e.g., ‘yoy’, ‘levels’).
benchmark_model (str) – The forecast source to use as the benchmark for comparison (e.g., ‘mpr’).
statistic (str, optional) – The accuracy statistic to compare. Must be one of ‘rmse’, ‘rmedse’, or ‘mean_abs_error’. Default is ‘rmse’.
horizons (list of int, optional) – List of forecast horizons to include in the table. Default is [0, 1, 2, 4, 8, 12].
- Returns:
Pivot table with MultiIndex columns where:
Index: forecast sources (excluding baseline models and benchmark model)
Columns: MultiIndex with ‘Forecast horizon’ as top level and forecast horizons as second level
Values: ratio of model’s accuracy statistic to benchmark model’s statistic
- Return type:
pandas.DataFrame
- forecast_evaluation.tests.diebold_mariano_table(data, benchmark_model: str, k: int = 12, loss_function: Literal['mse', 'mae'] = 'mse', horizons: list[int] = None) TestResult[source]#
Run Diebold-Mariano tests comparing all models to a benchmark across all series.
This function performs DM tests for every combination of variable, metric, frequency, and forecast horizon, comparing each model’s forecast errors to the benchmark model.
- Parameters:
data (ForecastData) – ForecastData object containing the main table with forecast accuracy data.
benchmark_model (str) – The forecast source to use as the benchmark (e.g., ‘mpr’)
k (int, optional) – Number of revisions used to define the outturns. Default is 12.
loss_function (Literal["mse", "mae"], optional) – Loss function to use for comparison. Default is “mse”.
horizons (list of int, optional) – List of forecast horizons to test. Default is all horizons in the data.
- Returns:
TestResult object containing the summary DataFrame with test results and metadata. The underlying DataFrame contains columns:
’variable’: Variable identifier
’metric’: Metric identifier
’frequency’: Frequency identifier
’source’: Model being compared to benchmark
’forecast_horizon’: Forecast horizon
’dm_statistic’: DM test statistic
’p_value’: P-value from DM test
’n_observations’: Number of observations used
’rmse_ratio’: Ratio of model RMSE to benchmark RMSE
’benchmark_source’: Benchmark model name
- Return type:
- forecast_evaluation.tests.diebold_mariano_test(error_difference: Series, horizon: int) dict[source]#
Perform the Diebold-Mariano test to compare forecast accuracy between two models.
The Diebold-Mariano test assesses whether the difference in forecast accuracy between two models is statistically significant. It accounts for autocorrelation in forecast errors using Newey-West HAC standard errors.
We use the correction for small sample of Harvey, Leybourne, and Newbold (1997). We use the approach of Harvey, Leybourne, and Whitehouse (2017) which suggests using only the Bartlett kernel when the variance is negative. The standard variance estimator has better small sample properties.
- Parameters:
error_difference (pandas.Series) – Difference in forecast errors from the model being evaluated. error = (actual - forecast) error_difference = (errors_model)**2 - (errors_benchmark)**2; doesnt have to be square func
horizon (int, optional) – The forecast horizon (h-step ahead). Used to determine the number of lags for HAC standard errors.
- Returns:
Dictionary containing test results:
’dm_statistic’: float - The Diebold-Mariano test statistic
’p_value’: float - Two-tailed p-value (tests if losses are significantly different)
’mean_loss_diff’: float - Mean difference in losses (model - benchmark)
’interpretation’: str - Interpretation of the results
- Return type:
dict
Notes
Null hypothesis: The two models have equal predictive accuracy
Alternative hypothesis: The models have different predictive accuracy
A negative DM statistic indicates the model is more accurate than the benchmark
Uses Newey-West HAC standard errors with lags = horizon (because we start a horizon = 0)
References Diebold and Mariano (1995) : https://doi.org/10.2307/1392185 Harvey, Leybourne, and Newbold (1997) : https://doi.org/10.1016/S0169-2070(96)00719-4 Harvey, Leybourne, and Whitehouse (2017) https://doi.org/10.1016/j.ijforecast.2017.05.001
- forecast_evaluation.tests.fluctuation_tests(data: ForecastData, window_size: int, test_func: callable, test_args: dict = {}, start_vintage: str | None = None, end_vintage: str | None = None)[source]#
Perform fluctuation tests. A fluctuation test in practice is a test performed on a rolling window with adjusted critical values. The fluctuation H0 is that the original test statistic is not rejected in all windows.
- Parameters:
data (ForecastData) – ForecastData object containing the main table
window_size (int (>0)) – Number of vintages to include in each window
test_func (callable) – Test function to run on each window. Must be one of: ‘diebold_mariano_table’, ‘bias_analysis’, or ‘weak_efficiency_analysis’
test_args (dict, default={}) – Additional keyword arguments to pass to the test function. Should NOT include the ‘data’ parameter as it will be added automatically.
start_vintage (str, optional) – Start vintage date (format ‘YYYY-MM-DD’). If None, uses the earliest vintage.
end_vintage (str, optional) – End vintage date (format ‘YYYY-MM-DD’). If None, uses the latest vintage.
- Returns:
TestResult object with rolling test results, including: - ‘window_start’: Start vintage of each window - ‘window_end’: End vintage of each window - ‘test_statistic’: Test statistic for each window - ‘critical_value_05’: Critical value at 5% significance level - ‘critical_value_10’: Critical value at 10% significance level - ‘reject_05’: Boolean indicating rejection at 5% level - ‘reject_10’: Boolean indicating rejection at 10% level - ‘max_test_statistic’: Maximum test statistic across all windows for each group - ‘reject_max_05’: Boolean indicating if max statistic rejects at 5% level - ‘reject_max_10’: Boolean indicating if max statistic rejects at 10% level
- Return type:
- forecast_evaluation.tests.revision_predictability_analysis(data, variable: str | list[str] = None, source: None | str | list[str] = None, frequency: Literal['Q', 'M'] = 'Q', n_revisions: Annotated[int, Gt(gt=0)] = 5, same_date_range: bool = True) TestResult[source]#
Run the revision test for all unique combinations of variables in the dataset.
This function systematically applies the revision test to every unique combination of variable, source, metric, and frequency in the provided dataset.
Parameters:#
data: An instance of the ForecastData class containing ForecastData._forecasts. variable: Single variable name or list of variable names to analyse. source: Single source or list of forecast sources to include. frequency: Frequency of the data, either quarterly (“Q”) or monthly (“M”). n_revisions: Maximum number of forecast horizons/revisions to include in each test same_date_range: If True, ensures consistent date ranges across sources when multiple sources are analysed.
- returns:
TestResult object containing the summary DataFrame with test results and metadata. The underlying DataFrame contains columns:
‘variable’: str - variable name
‘source’: str - data source
‘metric’: str - metric type
‘frequency’: str - data frequency
‘joint_test_fstat’: float - F-statistic for joint test
‘joint_test_pvalue’: float - p-value for joint test
‘reject_null’: bool - whether to reject null at 5% level
‘n_observations’: int - number of observations in regression
Returns None if data is not available. Failed tests are excluded from the results.
- rtype:
TestResult
- forecast_evaluation.tests.revisions_errors_correlation_analysis(data: ForecastData, source: None | str | list[str] = None, variable: None | str | list[str] = None, k: int = 12, same_date_range: bool = True) TestResult[source]#
Run regressions of forecast revisions against forecast errors for all unique combinations of variable, source, metric, and forecast_horizon.
This function systematically tests forecast efficiency across all available forecast series by running the revisions-errors regression for each unique combination of forecasting parameters.
- Parameters:
data (ForecastData) – Class containing the main table.
source (None, str, or list of str, default=None) – Filter for specific forecast source(s). If None, includes all sources. Can be a single source name or a list of source names.
variable (None, str, or list of str, default=None) – Filter for specific variable(s). If None, includes all variables. Can be a single variable name or a list of variable names.
k (int, optional, default=12) – Number of revisions used to define the outturns.
same_date_range (bool, default=True) – If True, ensures consistent date ranges across sources when multiple sources are analysed. If False, uses all available data for each source independently.
- Returns:
TestResult object containing the summary DataFrame with test results and metadata. The underlying DataFrame contains:
source: str - forecast source identifier
variable: str - economic variable name
metric: str - measurement type
forecast_horizon: int - forecast horizon
const: float - intercept coefficient (α)
const_se: float - standard error of intercept
beta: float - slope coefficient (β)
beta_se: float - standard error of slope
const_pvalue: float - p-value for intercept test
beta_pvalue: float - p-value for slope test
correlated: bool - True if β is significant at 5% level
rsquared: float - coefficient of determination
n_observations: int - number of observations in regression
- Return type:
- forecast_evaluation.tests.rolling_analysis(data: ForecastData, window_size: int, analysis_func: callable, analysis_args: dict, start_vintage: str | None = None, end_vintage: str | None = None)[source]#
Perform rolling window analysis using any analysis function.
- Parameters:
data (ForecastData) – ForecastData object containing the main table
window_size (int (>0)) – Number of periods to include in each window
analysis_func (callable) – Analysis function to run on each window (e.g., compute_accuracy_statistics, blanchard_leigh_efficiency_test)
analysis_args (dict) – Additional keyword arguments to pass to the analysis function. Should NOT include the ‘ForecastData’ instance as it will be added automatically.
start_vintage (str, optional) – Start vintage date (format ‘YYYY-MM-DD’)
end_vintage (str, optional) – End vintage date (format ‘YYYY-MM-DD’)
- Returns:
TestResult object containing the concatenated results from all windows. The DataFrame includes columns from the analysis function plus:
’window_start’: Start vintage of the window
’window_end’: End vintage of the window
- Return type:
- forecast_evaluation.tests.strong_efficiency_analysis(data: ForecastData, source: str, outcome_variable: str, outcome_metric: Literal['levels', 'pop', 'yoy'], instrument_variable: str, instrument_metric: Literal['levels', 'pop', 'yoy'], horizons: ndarray = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]), j: int = 2, frequency: Literal['Q', 'M'] = 'Q', k: int = 12, alpha: float = 0.05) TestResult[source]#
Run strong efficiency tests across multiple forecast horizons.
This function performs strong efficiency tests by regressing forecast errors on instrument variables across multiple forecast horizons. It helps assess whether forecasts efficiently incorporate available information.
- Parameters:
data (ForecastData) – ForecastData object containing forecast and outturn data.
source (str) – Source of the forecasts (e.g., ‘MPR’, ‘OBR’).
outcome_variable (str) – Name of the outcome variable for which forecast errors are analysed.
outcome_metric (Literal["levels", "pop", "yoy"]) – Metric type for the outcome variable: - “levels”: Raw values - “pop”: Period-on-period percentage change - “yoy”: Year-on-year percentage change
instrument_variable (str) – Name of the instrument variable used in the regression.
instrument_metric (Literal["levels", "pop", "yoy"]) – Metric type for the instrument variable.
horizons (np.ndarray, optional) – Array of forecast horizons to test, by default np.arange(1, 13).
j (int, optional) – Forecast horizon of the instrument variable, by default 2.
frequency (Literal["Q", "M"], optional) – Frequency of the data, either quarterly (“Q”) or monthly (“M”), by default “Q”.
k (int, optional) – Number of revisions used to define the outturns, by default 12.
alpha (float, optional) – Significance level for confidence intervals, by default 0.05.
- Returns:
TestResult object containing a DataFrame with results for each horizon and metadata about the test parameters.
- Return type:
- forecast_evaluation.tests.weak_efficiency_analysis(data: ForecastData, source: None | str | list[str] = None, variable: None | str | list[str] = None, k: int = 12, same_date_range: bool = True, verbose: bool = False) TestResult[source]#
Run weak efficiency tests for all unique combinations in the dataset.
This function systematically performs weak efficiency testing across all available combinations of variable, source, metric, and forecast_horizon in the dataset using the weak_efficiency_test function. It provides a comprehensive analysis of forecast efficiency across different variables, sources, and horizons.
- Parameters:
data (ForecastData) – Class containing the main table.
source (None, str, or list of str, default=None) – Filter for specific forecast source(s). If None, includes all sources. Can be a single source name or a list of source names.
variable (None, str, or list of str, default=None) – Filter for specific variable(s). If None, includes all variables. Can be a single variable name or a list of variable names.
k (int, default=12) – Number of revisions used to define the outturns.
same_date_range (bool, default=True) – If True, ensures consistent date ranges across sources when multiple sources are analysed. If False, uses all available data for each source independently.
verbose (bool, default=False) – If True, prints detailed results for each individual weak efficiency test including coefficient estimates, test statistics, and conclusions. If False, only prints summary progress information.
- Returns:
TestResult object containing a DataFrame with results for each combination and metadata about the test parameters. The underlying DataFrame contains columns:
’source’ : str - Forecast source identifier
’variable’ : str - Variable identifier
’metric’ : str - Metric identifier
’frequency’ : str - Data frequency identifier
’forecast_horizon’ : int - Forecast horizon identifier
’forecast_coef’ : float - Coefficient on value_outturn from Mincer-Zarnowitz regression
’constant_coef’ : float - Constant term coefficient from Mincer-Zarnowitz regression
’forecast_se’ : float - HAC-corrected standard error of forecast coefficient
’constant_se’ : float - HAC-corrected standard error of constant coefficient
’joint_test_fstat’ : float - F-statistic for joint hypothesis test of weak efficiency
’joint_test_pvalue’ : float - P-value for joint hypothesis test
’reject_weak_efficiency’ : bool - True if weak efficiency is rejected at 5% significance level
’n_observations’ : int - Number of observations used in each test
’ols_model’ : object - Full OLS model results (statsmodels RegressionResults object)
- Return type: