Utilities#

forecast_evaluation.utils.covid_filter(df: DataFrame) DataFrame[source]#

Filter data to exclude COVID-affected periods.

For ‘gdpkp’ variable: excludes dates from 2020-01-01 to 2022-03-31 unless forecast vintage is from 2022-01-01 onwards. For other variables: removes all 2020 and 2021 dates for pre-2020Q4 vintages.

Parameters:

df (pd.DataFrame) – DataFrame containing forecast data with ‘variable’, ‘date’, and ‘vintage_date_forecast’ columns

Returns:

Filtered DataFrame with COVID periods removed based on variable type

Return type:

pd.DataFrame

forecast_evaluation.utils.ensure_consistent_date_range(df: DataFrame) DataFrame[source]#

Filter data to ensure consistent vintage_date_forecast range across all sources for each variable.

This function addresses the problem where different forecast sources may have different availability periods for the same variable. It finds the overlapping time period where ALL sources have data for each variable, ensuring fair comparison across models.

The function uses the latest start date and earliest end date across all sources for each variable, creating a “common denominator” time period.

Parameters:

df (pandas.DataFrame) –

DataFrame containing forecast accuracy data with required columns:

  • ’variable’ : str - Variable identifier (e.g., ‘gdpkp’, ‘cpisa’, ‘unemp’)

  • ’source’ : str - Forecast source identifier (e.g., ‘compass conditional’, ‘mpr’)

  • ’vintage_date_forecast’ : datetime - Forecast vintage date

Returns:

Filtered DataFrame containing only data within the consistent date range for each variable.

Return type:

pandas.DataFrame

forecast_evaluation.utils.filter_k(df: DataFrame, k: int = 12, fill_k: bool = True) DataFrame[source]#

Filter the dataset for a particular k, replacing unreleased outturn vintages with the latest vintage.

Parameters:
  • df (pd.DataFrame) – DataFrame containing forecast data with ‘k’, ‘latest_vintage’, and ‘vintage_date_outturn’ columns

  • k (int, default=12) – Number of revisions to filter by

  • fill_k (bool, default=True) – If True, substitutes unreleased outturns from the latest vintage

Returns:

Filtered DataFrame containing only rows where k matches or latest vintage is used

Return type:

pd.DataFrame

forecast_evaluation.utils.filter_sources(df: DataFrame, sources: list[str]) DataFrame[source]#

Filter the dataset based on source identifiers.

Parameters:
  • df (pd.DataFrame) – DataFrame containing forecast data with ‘unique_id’ column

  • sources (list[str]) – List of source identifiers to filter by

Returns:

Filtered DataFrame containing only rows matching the specified sources

Return type:

pd.DataFrame

forecast_evaluation.utils.find_ids_to_exclude(df: DataFrame, sources: list[str]) list[str][source]#

Work by exclusion (which should be the most efficient approach). Let’s say we have id1 = [“A”, “B”] and id2 = [“big”, “small”]. If a user selects sources = [“A”], all unique_id with “B” should be excluded. If a user selects sources = [“C”], all unique_id with “A” and “B” should be excluded. So first we have to find the ids to exclude.

Parameters:
  • unique_id (str) – Concatenated identifier string with parts separated by ‘+’

  • sources (list[str]) – List of source identifiers to check against

Returns:

List of identifier parts to exclude

Return type:

list[str]

forecast_evaluation.utils.flatten_col_name(obj: DataFrame | Series) DataFrame | Series[source]#

Convert tuple column names to strings

Parameters:

obj (pd.DataFrame or pd.Series) – DataFrame or Series with potentially multi-level column names

Returns:

DataFrame or Series with flattened column names

Return type:

pd.DataFrame or pd.Series

forecast_evaluation.utils.reconstruct_id_cols_from_unique_id(df: DataFrame, id_columns: list[str]) DataFrame[source]#

Reconstruct individual identifier columns from the ‘unique_id’ column.

Parameters:
  • df (pd.DataFrame) – DataFrame containing a ‘unique_id’ column with concatenated identifiers.

  • id_columns (list of str) – List of column names to assign to the reconstructed identifier parts.

Returns:

DataFrame with reconstructed identifier columns.

Return type:

pd.DataFrame