-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic interface #70
Conversation
files in a path and records all variables in that file.
This helps address issue #55. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for putting this together, @sherimickelson! I left some minor comments
builders/tslice.py
Outdated
look_for_unlim=[d.dimensions[dim].isunlimited() for dim in dims] | ||
unlim=[i for i, x in enumerate(look_for_unlim) if x] | ||
unlim_dim=dims[unlim[0]] | ||
fileparts['time_range'] = str(d[unlim_dim][0])+'-'+str(d[unlim_dim][-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a set of files from different models that we could use to set up a testing framework for this. I am happy to help set this up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not tested this on WRF files, but from what I remember, I think this isn't working because Time is just a coordinate and not a variable. So I'll need another way to get the time range for WRF files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andersy005 for looking at this draft. That would be great if we can create a set of files we can test with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sherimickelson When you say "Time is just a coordinate," are you saying "Time is just a dimension"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ask because in xarray
(and according to the CF conventions) a "coordinate" is a type of "variable." And a "dimension" is just a name with a size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmpaul, yes, I'm sorry. It is just a dimension.
builders/tslice.py
Outdated
d = nc.Dataset(filepath,'r') | ||
# find what the time (unlimited) dimension is | ||
dims = list(dict(d.dimensions).keys()) | ||
look_for_unlim=[d.dimensions[dim].isunlimited() for dim in dims] | ||
unlim=[i for i, x in enumerate(look_for_unlim) if x] | ||
unlim_dim=dims[unlim[0]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andersy005 good catch. I'll look at catching this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andersy005 I'm using this code to include only time variant fields in the variable list. Should it include all variables instead of limiting what is included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it include all variables instead of limiting what is included?
Thinking about this a bit, I don't really know whether we should exclude time invariant fields or not. I am going to let others chime in. Cc @kmpaul
If we include only time variant fields in the variable, is there any guarantee that the unlimited dimensions are always specified in the netCDF file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Usually, time-invariant fields are "pseudo-coordinates"...or metadata fields. But I don't think we can guarantee that is true for all data. Maybe we should conform to
xarray
's standard practice, which I think is to include everything in the data variables unless it can be unambiguously identified as a "coordinate". -
I don't think you can guarantee that "unlimited" dimensions will even exist. My understanding is that some data, for reasons I cannot remember right now, explicitly removes the "unlimited" dimension from all files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll implement @kmpaul 's suggestion.
@sherimickelson and I just chatted, and she asked if I could share one of my existing YAML templates and the corresponding was used to produce
The process there was to use a legacy version of
Some comments from our chat that I think are worth putting in writing here:
Below is the block defining the
|
Look into adding a YAML validator. Possible choices: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really nice. Is there a test for this?
builders/tslice.py
Outdated
filelist = get_asset_list(stream_info['glob_string'], depth=0) | ||
stream_info.pop('glob_string') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make use of the pop
call:
filelist = get_asset_list(stream_info['glob_string'], depth=0) | |
stream_info.pop('glob_string') | |
glob_string = stream_info.pop('glob_string') | |
filelist = get_asset_list(glob_string, depth=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no tests for this yet. I just wanted to push my latest working version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review @kmpaul
Thanks for catching this. This was left over from an earlier version. Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed something.
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Example yaml file that contains the nested globs:
This format requires the key, |
Created a schema from the yaml interface that validates with Yamale.
Example yaml:
Validates with this schema:
The validation step will be added to the Python code in builders/tslice.py in a future version. |
…mitted yesterday.
… criteria that exists within the schema
While adding the schema validation into the code, a couple of issues showed up. This required a modification to the example yaml as well as the schema. The new versions follow: Example yaml that defines what should be in the catalog
The new schema used to validate
|
Code now is able to validate the input yaml against the schema with Yamale and internally if Yamale is not available. Code checks for a successful import, if it fails, it does the validation internally. |
Based on discussions I've added the ability to use netcdf file/variable attributes within the yaml file. This is related to #64 The yaml syntax to use this feature is:
<var> implies to use the value for global attribute 'var' |
Closing this as it appears to have been addressed in ncar-xdev/ecgtools#5 |
This version seeks to work on generic datasets, including time slice files that contain multiple time variant variables.
The current interface takes in a yaml file and produces the catalog information needed by intake-esm (both a csv and json file).
Here's an example of the proposed yaml interface:
The interface contains two tiers and follows a specific format. The first tier must contain the key
data_sources
. Any other keys added at this tier will be treated as global key/value pairs (or pandas df column), added to all data sources under this experiment. The second tier must contain the keyglob_string
. This string should glob all files that should be included within this "chunk". These files will contain all of the key/value pairs (or pandas df columns) under this data source. Just like the first tier, these keys can be whatever the user would like.These changes will not work yet under intake-esm. We still need to commit the changes needed to get lists of variables working. This work has also been started.