Letting users define parameters and additional dimensions in YAML files #642

irm-codebase · 2024-07-19T15:41:26Z

What can be improved?

Opening this issue after discussions with @sjpfenninger and @brynpickering

Currently, the math data and schemas are a bit 'mixed'. In particular, model_def_schema.yaml contains parameter defaults, which makes looking for them difficult.

Similarly, the current parsing of yaml files is a bit difficult in where/how new dimensions and parameters are declared.

For parameters:

Declare current parameters in base.yaml instead.
Make users explicitly declare new parameters in yaml files, with at least the default value. We can just add a new params section for it.
Parameters must at least have "default" declared (with some additional logic to convert 0 to null, or warning users of it).

For dimensions:

Make users declare new dimensions in yaml files. Possible important settings are "ordered" / "numeric" for stuff like years and timesteps were the distance between them might matter.
Ditto for groups (see Easily define group constraints with custom math #604 for discussion).

The idea is to make model definition files less ambiguous. We probably should also think on how this affects schema evaluation, and some of the parsing.

Version

v0.7

The text was updated successfully, but these errors were encountered:

brynpickering · 2024-08-14T07:48:29Z

I've had a look at this and think it could work pretty well most of the time. However, there is the issue that config for parameters is then only analysed on calling calliope.Model.build. This is problematic for time resampling since one of the options is about how to resample a given parameter (taking the average or the sum).

One thing we could do is resample the data on-the-fly when calling calliope.Model.build (something I considered in the past), so the timeseries isn't resampled at instatiation. This would probably work fine but would come with the disadvantage that either results have to be in their own xarray dataset (with a different timesteps dimension) or the results are returned with a sparse timesteps dimension (e.g. every other timestep has a value with a 2h resolution).

If we continue with resampling at instatiation, there needs to be a way of defining parameter config that is separate from the math - does this get messy?

irm-codebase · 2024-08-14T09:53:45Z

I think it's better to go with the option that is (long term) easier to manage, which would be resampling during build.

Functionality-wise, it makes sense: you are telling the model to build/prepare itself. If need be, we can split backend steps and just wrap around them:

Model init
Model build
a. build.math
b. build.timeseries
c. build.whatever
Model solve...

For now, let us assume that we control the build process (i.e., model.build will run everything).
If we want users to be able to build specifics, we'll need an attribute that lets us know the state machine's status.

brynpickering · 2024-08-14T11:02:26Z

I agree that it is easier to manage on our side, but how about the data structures that come out of the other end? If you have hourly input data and then resample to monthly data, you'll get a timeseries of 8760 elements on the inputs and only 12 elements on the outputs... Should resampled inputs be available somewhere? If yes, then we risk bloating the model if we resample to close to the input resolution (e.g. 2hrs). If no, visualising input and output timeseries data will be a pain.

irm-codebase · 2024-08-14T11:48:16Z

That's where the separation of post-processing (#638) and non-conflicting configurations (#626) come in!

If done right, the selected "mode" might activate some post-processing step that makes data cleaner (in the case of re-sampling, you can activate/deactivate a "de-sampler"?).
Although to be honest I do not see this as an issue in the case of re-sampling... you request monthly data, you get monthly data. Otherwise we'd bloat the model output unnecessarily...

Also, post-processing stuff should only support "official" modes. If users define their own math, is up to them to post-treat it (like any other piece of software, really).

brynpickering · 2024-08-14T12:52:05Z

I think this is separate to either of those. It's instead about the storage of data indexed over the time dimension. We would either need to split the inputs and results into two separate xarray datasets with different length timesteps or have them in one dataset with lots of empty data. It's perhaps more linked to #634.

If two separate datasets, on saving to file we'd merge those two and still end up with sparse data in the time dimension.

irm-codebase · 2024-08-14T14:41:52Z

@brynpickering just to confirm:
sparse data in this case would mean that xarray will not fill in those spaces with nan?

Because otherwise we have an issue in our hands, since nan is not sparse.

brynpickering · 2024-08-16T11:39:34Z

It will fill in those data with NaN. I mean sparse in terms of it being a sparse matrix (more empty than non-empty entries) with NaN being used to describe empty entries.

irm-codebase · 2024-08-16T12:24:56Z

In that case, I would avoid doing this unless we guarantee data sparsity through sparse (or something similar), because each of those nan values will take 64 bits (the same as a float). Given the size of our models, this will result in very large datasets that are mostly empty.

Keeping inputs and results separate is better since it saves us from significant bloat, data wise, I think...

brynpickering · 2024-09-26T08:47:17Z

The size of our multi-dimensional arrays is a fraction of the memory required to solve a model. So I really wouldn't worry about very large, mostly empty datasets. For a recent test I ran, we're talking ~240MB of input data in memory for 25-60GB of peak memory use (depending on the backend) to optimise.

irm-codebase · 2024-09-30T14:22:01Z

Hmm... You are right. We should go with the solution that is most understandable to users...
I'd say that is loading and resampling data on the same step, to avoid this situation.

To summarize:

The current implementation loads and re-samples all files (including timeseries) during init, but loads math during build. This gets messy if params are part of the stuff we have to parse.
The new approaches could be either of these
- Load parameters and data_tables at the same time during build (users can no longer check loaded data before build, rendering init kind of pointless...)
- Split the reading of files into two stages: do params, data_tables and re-sampling during init, and the rest during build (variables, expressions, etc).

The second might be better from a user perspective, but it requires care to not tangle stuff too much (hopefully the math rework aids in keeping it relatively clean?).

brynpickering · 2024-09-30T21:54:23Z

The second option is more like:

Split the reading of files into two stages: do loading of math dict, data_tables and re-sampling during init, and the rest during build (converting variables, expressions, etc. into backend objects).

As in, the user needs all their own math content to be ready to go for init. This would then need to be stored as an intermediate CalliopeMath object. The final math passed to the backend is collated at build based on any ad-hoc math dict provided as a kwarg and whether to include the mode-specific "base" math.

The advantage of this is that you can define your own math under a math key in the model definition, rather than defining math files as a list of file references at the build stage. It's an advantage as it is more in-line with how everything else is defined in YAML.

A third option is to store the dims and parameters definitions in a separate sub-heading of the model definition (i.e., not linked to math), so that they are more clearly defining metadata of dims and params that are loaded at init. Then we revert to a user defining all the other math components as inputs at the build stage, as it is in #639.

irm-codebase · 2024-10-01T16:22:26Z

I like the second option the most. It keeps everything neatly in one place, which is generally better for maintainability...
The third one could confuse users in my opinion.

Regarding software, if we have everything under math: and don't want to change things too much, would this be what you have in mind?

CalliopeMath.data now holds additional parameters and dims dictionaries (perhaps turning into a typed dict or dataclass). This is read from math:
Model.inputs / Model._model_data is pre-filled using dims and parameters.
The backend is generated using Model._model_data.inputs and the remaining CalliopeMath.data (variables, global_expressions, constraints, piecewise_constraints, objectives).

brynpickering · 2024-10-10T13:26:57Z

This is becoming more complex the more I look into it. So, ideally we have the full set of parameter metadata loaded before we load the model data (in particular, so we can type check/coerce the incoming data, but also to know how to aggregate data on resampling). This requires knowing everything about the math, including the mode in which we plan to build it (energy_cap is a parameter in operate mode, but not in plan mode). So there is a danger that we have to shift all the math-related config to init.

This isn't ideal, as one should be able to switch between build modes at the Model.build stage.

So we're back to choosing between cleaning up the data (incl. time resampling) at init, and therefore requiring access to the full math.parameters and math.dims definitions at that point, or we don't try and do anything with the input data until build is called.

Although neither solution seems very good, I think the general idea of explicitly defining params and dims as part of the overall math definition is a good idea. @sjpfenninger perhaps you could weigh in?

irm-codebase · 2024-10-23T14:14:35Z

Hmm... I think having parameters and dims at init is a base requirement of this change anyhow...

Regarding the mode, perhaps it is not as big of a problem if we separate the current mode from the initialisation mode. This would enable us to do something like initialise in plan, obtain a solution, change the mode to operate, and then do more stuff.

How about this:

model.mode holds the current mode. During initialisation, config.init.mode is passed to this variable, and is not used again. This allows us to re-sample as needed, and config.init.mode is kept so we can track how a model was initialised.
We provide an interface to change the mode, so users can make this change explicitly. This can be saved/read via the .nc file in a similar way to how the new math class works.

Do you think this would work?
If so, we should perhaps consider taking a similar approach to other Model values that affect behaviour and that might deviate from config.init during runtime (e.g., something like model.status.mode?) .

brynpickering · 2024-11-08T17:31:05Z

I've come back to think about this again and think we might get away without doing anything with the input data until we run build. That is, we go from this:

To this:

So, we do create an intermediate "inputs" array in which the parameter and dimension definitions are applied to the initial inputs. Here, we can verify data types, attach default data, and resample dimensions (if necessary). We can do this on a filtered view of the inputs since we can just keep those parameters defined in the math "parameters".

This leads to potentially results and inputs having different length timeseries (and nodes if we allow spatial resampling in future??). So, we keep those two datasets completely separate. We can even go so far as to keep them separate in the NetCDF (/Zarr ZIP file in future) using groups. We could then have a method to spit out the intermediate input dataset as a separate xarray dataset (for debugging?).

Thoughts @irm-codebase @sjpfenninger ?

irm-codebase added enhancement v0.7 (upcoming) version 0.7 labels Jul 19, 2024

irm-codebase changed the title ~~Defining parameters and additional dimensions in model.yaml~~ Letting users define parameters and additional dimensions in YAML files Jul 20, 2024

irm-codebase mentioned this issue Jul 20, 2024

Default parameter duplication and separation of "lookups" #637

Open

3 tasks

brynpickering mentioned this issue Aug 7, 2024

Fixes in docs and new data checks #644

Merged

4 tasks

irm-codebase added this to the 0.7.0 milestone Aug 9, 2024

irm-codebase assigned irm-codebase and brynpickering and unassigned irm-codebase Aug 9, 2024

brynpickering mentioned this issue Sep 26, 2024

Use of native sparse array support in xarray / pandas / netCDF #634

Closed

This was referenced Nov 13, 2024

Add support for descriptive/passthrough data within the YAML schema #709

Open

Add param & dim math config #712

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Letting users define parameters and additional dimensions in YAML files #642

Letting users define parameters and additional dimensions in YAML files #642

irm-codebase commented Jul 19, 2024 •

edited

Loading

brynpickering commented Aug 14, 2024

irm-codebase commented Aug 14, 2024 •

edited

Loading

brynpickering commented Aug 14, 2024

irm-codebase commented Aug 14, 2024 •

edited

Loading

brynpickering commented Aug 14, 2024

irm-codebase commented Aug 14, 2024 •

edited

Loading

brynpickering commented Aug 16, 2024

irm-codebase commented Aug 16, 2024 •

edited

Loading

brynpickering commented Sep 26, 2024

irm-codebase commented Sep 30, 2024

brynpickering commented Sep 30, 2024 •

edited

Loading

irm-codebase commented Oct 1, 2024

brynpickering commented Oct 10, 2024

irm-codebase commented Oct 23, 2024

brynpickering commented Nov 8, 2024

Letting users define parameters and additional dimensions in YAML files #642

Letting users define parameters and additional dimensions in YAML files #642

Comments

irm-codebase commented Jul 19, 2024 • edited Loading

What can be improved?

Version

brynpickering commented Aug 14, 2024

irm-codebase commented Aug 14, 2024 • edited Loading

brynpickering commented Aug 14, 2024

irm-codebase commented Aug 14, 2024 • edited Loading

brynpickering commented Aug 14, 2024

irm-codebase commented Aug 14, 2024 • edited Loading

brynpickering commented Aug 16, 2024

irm-codebase commented Aug 16, 2024 • edited Loading

brynpickering commented Sep 26, 2024

irm-codebase commented Sep 30, 2024

brynpickering commented Sep 30, 2024 • edited Loading

irm-codebase commented Oct 1, 2024

brynpickering commented Oct 10, 2024

irm-codebase commented Oct 23, 2024

brynpickering commented Nov 8, 2024

irm-codebase commented Jul 19, 2024 •

edited

Loading

irm-codebase commented Aug 14, 2024 •

edited

Loading

irm-codebase commented Aug 14, 2024 •

edited

Loading

irm-codebase commented Aug 14, 2024 •

edited

Loading

irm-codebase commented Aug 16, 2024 •

edited

Loading

brynpickering commented Sep 30, 2024 •

edited

Loading