-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Letting users define parameters and additional dimensions in YAML files #642
Comments
model.yaml
I've had a look at this and think it could work pretty well most of the time. However, there is the issue that config for parameters is then only analysed on calling One thing we could do is resample the data on-the-fly when calling If we continue with resampling at instatiation, there needs to be a way of defining parameter config that is separate from the math - does this get messy? |
I think it's better to go with the option that is (long term) easier to manage, which would be resampling during build. Functionality-wise, it makes sense: you are telling the model to build/prepare itself. If need be, we can split backend steps and just wrap around them:
For now, let us assume that we control the build process (i.e., |
I agree that it is easier to manage on our side, but how about the data structures that come out of the other end? If you have hourly input data and then resample to monthly data, you'll get a timeseries of 8760 elements on the inputs and only 12 elements on the outputs... Should resampled inputs be available somewhere? If yes, then we risk bloating the model if we resample to close to the input resolution (e.g. 2hrs). If no, visualising input and output timeseries data will be a pain. |
That's where the separation of post-processing (#638) and non-conflicting configurations (#626) come in! If done right, the selected "mode" might activate some post-processing step that makes data cleaner (in the case of re-sampling, you can activate/deactivate a "de-sampler"?). Also, post-processing stuff should only support "official" modes. If users define their own math, is up to them to post-treat it (like any other piece of software, really). |
I think this is separate to either of those. It's instead about the storage of data indexed over the time dimension. We would either need to split the inputs and results into two separate xarray datasets with different length If two separate datasets, on saving to file we'd merge those two and still end up with sparse data in the time dimension. |
@brynpickering just to confirm: Because otherwise we have an issue in our hands, since |
It will fill in those data with NaN. I mean sparse in terms of it being a sparse matrix (more empty than non-empty entries) with NaN being used to describe empty entries. |
In that case, I would avoid doing this unless we guarantee data sparsity through sparse (or something similar), because each of those Keeping inputs and results separate is better since it saves us from significant bloat, data wise, I think... |
The size of our multi-dimensional arrays is a fraction of the memory required to solve a model. So I really wouldn't worry about very large, mostly empty datasets. For a recent test I ran, we're talking ~240MB of input data in memory for 25-60GB of peak memory use (depending on the backend) to optimise. |
Hmm... You are right. We should go with the solution that is most understandable to users... To summarize:
The second might be better from a user perspective, but it requires care to not tangle stuff too much (hopefully the math rework aids in keeping it relatively clean?). |
The second option is more like:
As in, the user needs all their own The advantage of this is that you can define your own math under a A third option is to store the |
I like the second option the most. It keeps everything neatly in one place, which is generally better for maintainability... Regarding software, if we have everything under
|
This is becoming more complex the more I look into it. So, ideally we have the full set of parameter metadata loaded before we load the model data (in particular, so we can type check/coerce the incoming data, but also to know how to aggregate data on resampling). This requires knowing everything about the math, including the mode in which we plan to build it ( This isn't ideal, as one should be able to switch between build modes at the So we're back to choosing between cleaning up the data (incl. time resampling) at Although neither solution seems very good, I think the general idea of explicitly defining params and dims as part of the overall math definition is a good idea. @sjpfenninger perhaps you could weigh in? |
Hmm... I think having Regarding the mode, perhaps it is not as big of a problem if we separate the current mode from the initialisation mode. This would enable us to do something like initialise in How about this:
Do you think this would work? |
I've come back to think about this again and think we might get away without doing anything with the input data until we run To this: So, we do create an intermediate "inputs" array in which the parameter and dimension definitions are applied to the initial inputs. Here, we can verify data types, attach default data, and resample dimensions (if necessary). We can do this on a filtered view of the inputs since we can just keep those parameters defined in the math "parameters". This leads to potentially results and inputs having different length timeseries (and nodes if we allow spatial resampling in future??). So, we keep those two datasets completely separate. We can even go so far as to keep them separate in the NetCDF (/Zarr ZIP file in future) using groups. We could then have a method to spit out the intermediate input dataset as a separate xarray dataset (for debugging?). Thoughts @irm-codebase @sjpfenninger ? |
What can be improved?
Opening this issue after discussions with @sjpfenninger and @brynpickering
Currently, the math data and schemas are a bit 'mixed'. In particular,
model_def_schema.yaml
contains parameter defaults, which makes looking for them difficult.Similarly, the current parsing of yaml files is a bit difficult in where/how new dimensions and parameters are declared.
For parameters:
base.yaml
instead.params
section for it.For dimensions:
The idea is to make model definition files less ambiguous. We probably should also think on how this affects schema evaluation, and some of the parsing.
Version
v0.7
The text was updated successfully, but these errors were encountered: