Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolate postprocessing data creation methods #638

Open
irm-codebase opened this issue Jul 16, 2024 · 10 comments
Open

Isolate postprocessing data creation methods #638

irm-codebase opened this issue Jul 16, 2024 · 10 comments
Assignees
Labels
enhancement v0.7 (upcoming) version 0.7
Milestone

Comments

@irm-codebase
Copy link
Contributor

irm-codebase commented Jul 16, 2024

What can be improved?

At the moment, model.solve will automatically run several post processing methods.

However, this might not be desired in some circumstances, and is a side-effect. It's much more valuable to give users the option to run these commands at request instead of forcing them.

I recommend the following:

  • Doing some slight re-work to postprocessing so that it's a module users can call and pass Model objects to.
  • Removing the use of "optional" postprocessing methods in model.py.
  • Moving "essential" postprocessing methods (clean_results?) into model.py.

This should result in leaner code and more value for users. Let me know what you think!

Version

v0.7

@irm-codebase irm-codebase added enhancement v0.7 (upcoming) version 0.7 labels Jul 16, 2024
@brynpickering
Copy link
Member

Sounds good. We just need a way to trigger preprocessing steps when running calliope from the command line.

@irm-codebase irm-codebase self-assigned this Jul 19, 2024
@brynpickering
Copy link
Member

I've been thinking about adding a model.postprocess(...) step, which would be linked to a config.postprocess config level.

It would allow the results to be cleaned/expanded in whatever way the user has configured. A question remains as to whether this should return a post-processed result array or to update the results in-place. If updating in-place, the original results would still be available in an interactive session (they are extracted from optimisation backend objects, which we could just do again). We have had problems in the past with returning a result array that is independent of a Calliope object. Namely, it makes saving the file with relevant metadata a pain.

@irm-codebase
Copy link
Contributor Author

irm-codebase commented Jul 29, 2024

Hmm... could you elaborate on what those saving issues were @brynpickering ?
In my mind, modifying the output is not a good idea. It could easily spiral out of control in terms of checks and edge cases...
We should ensure that whatever we save to the netCDF / zarr file is only what is necessary to rebuild the same calliope model (config, math, inputs, results???).

If we make sure that postprocessing is only for data transformations, then returning arrays which can be saved seems best to me...
No postprocessing method should impact constraints or be directly relevant to the optimisation problem.
If that is the case, then they are not really post-processing in the way I thought for this issue...

For example, calculating capacity factors does not really alter data. It just depends on it.

In my mind, these steps should always return the same postprocess result:

  • init -> build -> solve -> post-process "thing"
  • init -> build -> save -> open -> build -> solve -> post-process "thing"
  • init -> build-> solve -> save -> open -> post-process "thing"

@brynpickering
Copy link
Member

So, the issue is that if you postprocess the results and it returns an xarray dataset, that dataset won't have the config, math, or inputs attached. So you would have a dataset and can call to_netcdf() but what it stores is just the postprocessed results and nothing else. This is particularly problematic because it's a lot of work to link that dataset back to the calliope model object. The safest way to ensure that, no matter what, you have all the metadata attached when you save, is to store the postprocessed results within the calliope model object. This also means you can also have a calliope model object call postprocessing on itself. But then you effectively have duplicated data in your object (since one postprocessing step is to clean the results data to remove very small numbers, for instance), which seems unnecessary.

If we kept postprocessing to just generating auxiliary variables (LCOE etc.) then a user would probably still want to save that data along with the rest of their model, for completeness, which requires that data being available within the calliope model object.

@irm-codebase
Copy link
Contributor Author

Seems like a "pick your poison" situation... but I think we should go for a solution that is pragmatic and does not over-complicate the software. I'm pro trusting users on the best way to save stuff (and lending them a hand if they don't).

Holding all the post-processed data within the model object does not feel right to me because:

  • It sounds like a combinatorial explosion antipattern.
  • Potential data duplication is a dangerous issue given the size of our models.
  • De-incentivizes "piping" approaches, since users most likely will solve and post-process in the same run, instead of solving, saving and maaaybe running some post-process later if need be.

How about following these requirements?

  • Post processing produces netCDFs with the model.name, model.scenario, and timestamps added as metadata. This way, users can find what model produced the data if need be since it will hold the same in its metadata.
  • "Standard" post-processing that should always be run (like removing small numbers) is instead moved to solve to ensure consistency.
  • Let users call any post-processing they need (to make the init, build, solve process as lean as possible).

If you do not like this idea, we can go with the internal approach, but we have to be careful on how we go about it.

@brynpickering
Copy link
Member

I just think that an xarray dataset (which you would save to netcdf) that contains a few data variables from postprocessing (LCOE, capacity factors, maybe curtailment? #295) is a bit of a pain to deal with, when you would then have another netcdf with all the other information (inputs, direct results from the optimisation, applied math dictionary). So, by returning the result of postprocessing outside the model object, you are separating your data from each other in a way that can easily lead them to becoming out of sync (you save your data from postprocessing, then re-run the model with some changed but forget to postprocess the new results so now your postprocessed data file is out of sync with your inputs+primary results data file).

Calling model.postprocess() should probably only add new data variables to model._model_data and not touch the results coming from the optimisation (so, e.g., cleaning tiny numbers should be a config.solve option that is applied before results are returned from the backend). Then you could either view that postprocessed data in model.results as additional variables or we could separate them out so that e.g. model.results_postprocessed filters the xarray dataset to show only the postprocessed variables.

@irm-codebase
Copy link
Contributor Author

I like the idea of separation. It's clearer down the line and easier to debug.

One last question, though. One of the ideas behind this PR is letting users call the post-processing functions they desire, without extras. This is to avoid bloating the dataset with stuff you might not need.
Should we keep postprocess() as it is? Or just make it easier to individual functionality?

@brynpickering
Copy link
Member

Yeah, I'd keep a config.postprocess option that users can set which things to include in postprocessing, which they can also update on-the-fly with kwargs when calling postprocess (e.g. model.postprocess(lcoe=true)).

@irm-codebase
Copy link
Contributor Author

Ok then. Here's a summary of the requirements/discussion. Please let me know if I missed anything!

  • New config.postprocess section, which just activates/deactivates some functions we provide.
  • postprocess should never modify backend results (or inputs, for that matter). It's just reshaping data and running extra calculations.
  • By default, there is no post-processing (to ensure it is the fastest possible).
  • postprocess functions that do affect data are either removed or taken to solve (e.g., small value cleanup).
  • postprocess results are saved to the dataset as new vars, with a flag/property that lets us find them quickly.

@sjpfenninger sjpfenninger added this to the 0.7.0 milestone Aug 6, 2024
@sjpfenninger
Copy link
Member

I agree that this makes sense, though I think the default for config.postprocess should be more or less what happens right now (e.g. to calculate levelised costs) - and the user can change this to disable postprocessing or add more postprocessing as desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement v0.7 (upcoming) version 0.7
Projects
None yet
Development

No branches or pull requests

3 participants