Skip to content

Commit

Permalink
Add data source dimension name mapping option
Browse files Browse the repository at this point in the history
  • Loading branch information
brynpickering committed Sep 25, 2024
1 parent 683a5b7 commit 41461c9
Show file tree
Hide file tree
Showing 5 changed files with 176 additions and 21 deletions.
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
## 0.7.0.dev5 (Unreleased)

### User-facing changes

|changed| cost expressions in math, to split out investment costs into the capital cost (`cost_investment`), annualised capital cost (`cost_investment_annualised`), fixed operation costs (`cost_operation_fixed`) and variable operation costs (`cost_operation_variable`, previously `cost_var`) (#645).

|new| dimension renaming functionality when loading from a data source, using the `map_dims` option (#680).

## 0.7.0.dev4 (2024-09-10)

### User-facing changes
Expand Down
54 changes: 52 additions & 2 deletions docs/creating/data_sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ In brief it is:
* **select**: values within dimensions that you want to select from your tabular data, discarding the rest.
* **drop**: dimensions to drop from your rows/columns, e.g., a "comment" row.
* **add_dims**: dimensions to add to the table after loading it in, with the corresponding value(s) to assign to the dimension index.
* **map_dims**: dimension names to map from those defined in the data table (e.g `time`) to those used in the Calliope model (e.g. `timesteps`).

When we refer to "dimensions", we mean the sets over which data is indexed in the model: `nodes`, `techs`, `timesteps`, `carriers`, `costs`.
In addition, when loading from file, there is the _required_ dimension `parameters`.
Expand Down Expand Up @@ -391,8 +392,6 @@ Or to define the same timeseries source data for two technologies at different n
columns: [nodes, techs, parameters]
```



=== "With `add_dims`"

| | |
Expand All @@ -418,6 +417,57 @@ Or to define the same timeseries source data for two technologies at different n
parameters: source_use_max
```

## Mapping dimension names

Sometimes, data tables are prepared in a model-agnostic fashion, and it would require extra effort to follow Calliope's dimension naming conventions.
To enable these tables to be loaded without Calliope complaining, we can rename dimensions when loading them using `map_dims`.

For example, if we have the `time` dimension in file, we can map it to the Calliope-compliant `timesteps` dimension:

=== "Without `map_dims`"

Data in file:

| timesteps | source_use_equals |
| ------------------: | :---------------- |
| 2005-01-01 12:00:00 | 15 |
| 2005-01-01 13:00:00 | 5 |

YAML definition to load data:

```yaml
data_sources:
pv_capacity_factor_data:
source: data_sources/pv_resource.csv
rows: timesteps
columns: parameters
add_dims:
techs: pv
```

=== "With `map_dims`"

Data in file:

| time | source_use_equals |
| ------------------: | :---------------- |
| 2005-01-01 12:00:00 | 15 |
| 2005-01-01 13:00:00 | 5 |

YAML definition to load data:

```yaml
data_sources:
pv_capacity_factor_data:
source: data_sources/pv_resource.csv
rows: timesteps
columns: parameters
add_dims:
techs: pv
map_dims:
time: timesteps
```

## Loading CSV files vs `pandas` dataframes

To load from CSV, set the filepath in `source` to point to your file.
Expand Down
11 changes: 10 additions & 1 deletion src/calliope/config/data_source_schema.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,15 @@ properties:
Dimensions in the rows and/or columns that contain metadata and should therefore not be passed on to the loaded model dataset.
These could include comments on the source of the data, the data license, or the parameter units.
You can also drop a dimension and then reintroduce it in `add_dims`, but with different index items.
map_dims:
type: object
description: >-
Mapping between dimension names in the data table being loaded to equivalent Calliope dimension names.
For instance, the "time" column in the data table would need to be mapped to "timesteps": `{"time": "timesteps"}`.
unevaluatedProperties:
type: string
description: Key is a Calliope dimension name (must not be in `rows` or `columns`), value is the dimension name in the data tables.
pattern: '^[^_^\d][\w]*$'
add_dims:
description: >-
Data dimensions to add after loading in the array.
Expand All @@ -70,4 +79,4 @@ properties:
'^[^_^\d][\w]*$':
type: [string, array]
description: Keys are dimension names (must not be in `rows` or `columns`), values are index items of that dimension to add.
$ref: "#/$defs/DataSourceVals"
$ref: "#/$defs/DataSourceVals"
62 changes: 46 additions & 16 deletions src/calliope/preprocess/data_sources.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import logging
from collections.abc import Hashable
from pathlib import Path
from typing import Literal

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -34,9 +35,10 @@ class DataSourceDict(TypedDict):
columns: NotRequired[str | list[str]]
source: str
df: NotRequired[str]
map_dims: NotRequired[dict[str, str]]
add_dims: NotRequired[dict[str, str | list[str]]]
select: dict[str, str | bool | int]
drop: Hashable | list[Hashable]
select: NotRequired[dict[str, str | bool | int]]
drop: NotRequired[Hashable | list[Hashable]]


class DataSource:
Expand Down Expand Up @@ -275,22 +277,28 @@ def _df_to_ds(self, df: pd.DataFrame) -> xr.Dataset:
"Data source must be a pandas DataFrame. "
"If you are providing an in-memory object, ensure it is not a pandas Series by calling the method `to_frame()`"
)
for axis, names in {"columns": self.columns, "index": self.index}.items():
if names is None:
if len(getattr(df, axis).names) != 1:
self._raise_error(f"Expected a single {axis} level in loaded data.")
df = df.squeeze(axis=axis)
else:
if len(getattr(df, axis).names) != len(names):
self._raise_error(
f"Expected {len(names)} {axis} levels in loaded data."
)
self._compare_axis_names(getattr(df, axis).names, names, axis)
df.rename_axis(inplace=True, **{axis: names})

tdf: pd.Series
axis_names: dict[Literal["columns", "index"], None | list[str]] = {
"columns": self.columns,
"index": self.index,
}
squeeze_me: dict[Literal["columns", "index"], bool] = {
"columns": self.columns is None,
"index": self.index is None,
}
for axis, names in axis_names.items():
if names is None and len(getattr(df, axis).names) != 1:
self._raise_error(f"Expected a single {axis} level in loaded data.")
elif names is not None:
df = self._rename_axes(df, axis, names)

for axis, squeeze in squeeze_me.items():
if squeeze:
df = df.squeeze(axis=axis)

if isinstance(df, pd.DataFrame):
tdf = df.stack(df.columns.names, future_stack=True).dropna()
tdf = df.stack(tuple(df.columns.names), future_stack=True).dropna()
else:
tdf = df

Expand All @@ -314,7 +322,6 @@ def _df_to_ds(self, df: pd.DataFrame) -> xr.Dataset:
tdf = pd.concat(
[tdf for _ in index_items], keys=index_items, names=[dim_name]
)

self._check_processed_tdf(tdf)
self._check_for_protected_params(tdf)

Expand All @@ -328,6 +335,29 @@ def _df_to_ds(self, df: pd.DataFrame) -> xr.Dataset:
self._log(f"Loaded arrays:\n{ds}")
return ds

def _rename_axes(
self, df: pd.DataFrame, axis: Literal["columns", "index"], names: list[str]
) -> pd.DataFrame:
"""Check and rename DataFrame index and column names according to data table definition.
Args:
df (pd.DataFrame): Loaded data table as a DataFrame.
axis (Literal[columns, index]): DataFrame axis.
names (list[str] | None): Expected dimension names along `axis`.
Returns:
pd.DataFrame: `df` with all dimensions on `axis` appropriately named.
"""
if len(getattr(df, axis).names) != len(names):
self._raise_error(f"Expected {len(names)} {axis} levels in loaded data.")
mapper = self.input.get("map_dims", {})
if mapper:
df.rename_axis(inplace=True, **{axis: mapper})
self._compare_axis_names(getattr(df, axis).names, names, axis)
df.rename_axis(inplace=True, **{axis: names})

return df

def _check_for_protected_params(self, tdf: pd.Series):
"""Raise an error if any defined parameters are in a pre-configured set of _protected_ parameters.
Expand Down
64 changes: 62 additions & 2 deletions tests/test_preprocess_data_sources.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import pandas as pd
import pytest
import xarray as xr

import calliope
from calliope.preprocess import data_sources
Expand Down Expand Up @@ -359,6 +360,65 @@ def test_drop_one(self, source_obj):
)


class TestDataSourceMapDims:
@pytest.fixture(scope="class")
def multi_row_one_col_data(self, data_dir, init_config, dummy_int):
"""Fixture to create the xarray dataset from the data source, including dimension name mapping."""

def _multi_row_one_col_data(
mapping: dict, new_idx: list, new_cols: list
) -> xr.Dataset:
df = pd.DataFrame(
{"foo": {("bar1", "bar2"): 0, ("baz1", "baz2"): dummy_int}}
)
filepath = data_dir / "multi_row_one_col_file.csv"
df.rename_axis(
index=["test_row1", "test_row2"], columns=["test_col"]
).to_csv(filepath)
source_dict: data_sources.DataSourceDict = {
"source": filepath.as_posix(),
"rows": new_idx,
"columns": new_cols,
"add_dims": {"parameters": "test_param"},
"map_dims": mapping,
}
ds = data_sources.DataSource(init_config, "ds_name", source_dict)
return ds.dataset

return _multi_row_one_col_data

def test_fails_without_rename(self, dummy_int, multi_row_one_col_data):
"""Test that without dimension name mapping, the dataframe doesn't load successfully."""
with pytest.raises(calliope.exceptions.ModelError) as excinfo:
multi_row_one_col_data({}, ["foobar", "test_row2"], ["test_col"])
assert check_error_or_warning(
excinfo,
"Trying to set names for index but names in the file do no match names provided | "
"in file: ['test_row1', 'test_row2'] | defined: ['foobar', 'test_row2'].",
)

@pytest.mark.parametrize(
("mapping", "idx", "col"),
[
({"test_row1": "foobar"}, ["foobar", "test_row2"], ["test_col"]),
(
{"test_row1": "foobar", "test_col": "foobaz"},
["foobar", "test_row2"],
["foobaz"],
),
],
)
def test_rename(self, dummy_int, multi_row_one_col_data, mapping, idx, col):
"""Test that dimension name mapping propagates through from the initial dataframe to the final dataset."""
dataset = multi_row_one_col_data(mapping, idx, col)
assert not any(k in dataset.dims for k in mapping.keys())
assert all(v in dataset.dims for v in mapping.values())
assert (
dataset["test_param"].sel(foobar="baz1", test_row2="baz2").item()
== dummy_int
)


class TestDataSourceMalformed:
@pytest.fixture(scope="class")
def source_obj(self, init_config):
Expand Down Expand Up @@ -455,7 +515,7 @@ def test_carrier_info_dict_from_model_data_var(self, source_obj, param, expected
def test_carrier_info_dict_from_model_data_var_missing_dim(self, source_obj):
with pytest.raises(calliope.exceptions.ModelError) as excinfo:
source_obj.lookup_dict_from_param("FOO", "foobar")
check_error_or_warning(
assert check_error_or_warning(
excinfo,
"Loading FOO with missing dimension(s). Must contain `techs` and `foobar`, received: ('techs', 'carriers')",
)
Expand Down Expand Up @@ -609,7 +669,7 @@ def test_transmission_tech_with_nodes(self, source_obj):
with pytest.raises(calliope.exceptions.ModelError) as excinfo:
source_obj(df_dict).node_dict(tech_dict)

check_error_or_warning(
assert check_error_or_warning(
excinfo,
"Cannot define transmission technology data over the `nodes` dimension",
)

0 comments on commit 41461c9

Please sign in to comment.