Utilities#

forecast_evaluation.utils.clean_unique_id(obj: DataFrame | str) → DataFrame | str[source]#

Strip trailing ' + ' separators from unique_id values.

When forecasts have unequal numbers of id components the concatenated unique_id column may contain trailing " + " fragments (e.g. "mpr2 + + "). This helper removes them so that display labels are clean.

Parameters:: obj (pd.DataFrame or str) – If a DataFrame, a copy is returned with the unique_id column cleaned. If a string, the cleaned string is returned.
Returns:: Cleaned object (never modifies the original DataFrame in-place).
Return type:: pd.DataFrame or str

forecast_evaluation.utils.covid_filter(df: DataFrame) → DataFrame[source]#

Filter data to exclude COVID-affected periods.

For ‘gdpkp’ variable: excludes dates from 2020-01-01 to 2022-03-31 unless forecast vintage is from 2022-01-01 onwards. For other variables: removes all 2020 and 2021 dates for pre-2020Q4 vintages.

Parameters:: df (pd.DataFrame) – DataFrame containing forecast data with ‘variable’, ‘date’, and ‘vintage_date_forecast’ columns
Returns:: Filtered DataFrame with COVID periods removed based on variable type
Return type:: pd.DataFrame

forecast_evaluation.utils.ensure_consistent_date_range(df: DataFrame) → DataFrame[source]#

Filter data to ensure consistent vintage_date_forecast range across all sources for each variable.

This function addresses the problem where different forecast sources may have different availability periods for the same variable. It finds the overlapping time period where ALL sources have data for each variable, ensuring fair comparison across models.

The function uses the latest start date and earliest end date across all sources for each variable, creating a “common denominator” time period.

Parameters:

df (pandas.DataFrame) –

DataFrame containing forecast accuracy data with required columns:

’variable’ : str - Variable identifier (e.g., ‘gdpkp’, ‘cpisa’, ‘unemp’)
’source’ : str - Forecast source identifier (e.g., ‘compass conditional’, ‘mpr’)
’vintage_date_forecast’ : datetime - Forecast vintage date

Returns:

Filtered DataFrame containing only data within the consistent date range for each variable.

Return type:

pandas.DataFrame

forecast_evaluation.utils.filter_k(df: DataFrame, k: int = 12, fill_k: bool = True) → DataFrame[source]#

Filter the dataset for a particular k, replacing unreleased outturn vintages with the latest vintage.

Parameters:

df (pd.DataFrame) – DataFrame containing forecast data with ‘k’, ‘latest_vintage’, and ‘vintage_date_outturn’ columns
k (int, default=12) – Number of revisions to filter by
fill_k (bool, default=True) – If True, substitutes unreleased outturns from the latest vintage

Returns:

Filtered DataFrame containing only rows where k matches or latest vintage is used

Return type:

pd.DataFrame

forecast_evaluation.utils.filter_sources(df: DataFrame, sources: list[str]) → DataFrame[source]#

Filter the dataset based on source identifiers.

Parameters:

df (pd.DataFrame) – DataFrame containing forecast data with ‘unique_id’ column
sources (list[str]) – List of source identifiers to filter by

Returns:

Filtered DataFrame containing only rows matching the specified sources

Return type:

pd.DataFrame

forecast_evaluation.utils.find_ids_to_exclude(df: DataFrame, sources: list[str]) → list[str][source]#

Work by exclusion (which should be the most efficient approach). Let’s say we have id1 = [“A”, “B”] and id2 = [“big”, “small”]. If a user selects sources = [“A”], all unique_id with “B” should be excluded. If a user selects sources = [“C”], all unique_id with “A” and “B” should be excluded. So first we have to find the ids to exclude.

Parameters:

df (pd.DataFrame) – DataFrame containing a ‘unique_id’ column with concatenated identifiers separated by ‘+’.
sources (list[str]) – List of source identifiers to check against.

Returns:

List of identifier parts to exclude

Return type:

list[str]

forecast_evaluation.utils.flatten_col_name(obj: DataFrame | Series) → DataFrame | Series[source]#

Convert tuple column names to strings

Parameters:: obj (pd.DataFrame or pd.Series) – DataFrame or Series with potentially multi-level column names
Returns:: DataFrame or Series with flattened column names
Return type:: pd.DataFrame or pd.Series

forecast_evaluation.utils.reconstruct_id_cols_from_unique_id(df: DataFrame, id_columns: list[str]) → DataFrame[source]#

Reconstruct individual identifier columns from the ‘unique_id’ column.

Parameters:

df (pd.DataFrame) – DataFrame containing a ‘unique_id’ column with concatenated identifiers.
id_columns (list of str) – List of column names to assign to the reconstructed identifier parts.

Returns:

DataFrame with reconstructed identifier columns.

Return type:

pd.DataFrame

Utilities

Contents

Utilities#