Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic interface #70

Closed
wants to merge 17 commits into from

Conversation

sherimickelson
Copy link

This version seeks to work on generic datasets, including time slice files that contain multiple time variant variables.

The current interface takes in a yaml file and produces the catalog information needed by intake-esm (both a csv and json file).

Here's an example of the proposed yaml interface:

my_experiment1:
    experiment_name: cesm_experiment_1
    member_id: '001'
    data_sources:
    - glob_string: /the/path/to/data/CESM_DATA/cesm_experiment_1/atm/hist/*.cam.h0.*
      model_name: cam
      time_freq: month_1
    - glob_string: /the/path/to/data/CESM_DATA/cesm_experiment_1/lnd/hist/*clm2.h0.*
      model_name: clm
      time_freq: month_1
    - glob_string: /the/path/to/data/CESM_DATA/cesm_experiment_1/ocn/hist/*pop.h.*
      model_name: pop
      time_freq: month_1
    - glob_string: /the/path/to/data/CESM_DATA/cesm_experiment_1//ice/hist/*cice.h.*
      model_name: cice
      time_freq: month_1

my_experiment2:
    experiment_name: mpas_experiment_1
    member_id: '001'
    data_sources:
    - glob_string: /the/path/to/data/MPAS_DATA/mpas_experiment_1/*
      model_name: mpas
      time_freq: model_step

The interface contains two tiers and follows a specific format. The first tier must contain the key data_sources. Any other keys added at this tier will be treated as global key/value pairs (or pandas df column), added to all data sources under this experiment. The second tier must contain the key glob_string. This string should glob all files that should be included within this "chunk". These files will contain all of the key/value pairs (or pandas df columns) under this data source. Just like the first tier, these keys can be whatever the user would like.

These changes will not work yet under intake-esm. We still need to commit the changes needed to get lists of variables working. This work has also been started.

@sherimickelson
Copy link
Author

This helps address issue #55.

@kmpaul kmpaul requested a review from a team April 20, 2020 20:24
Copy link
Contributor

@andersy005 andersy005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together, @sherimickelson! I left some minor comments

look_for_unlim=[d.dimensions[dim].isunlimited() for dim in dims]
unlim=[i for i, x in enumerate(look_for_unlim) if x]
unlim_dim=dims[unlim[0]]
fileparts['time_range'] = str(d[unlim_dim][0])+'-'+str(d[unlim_dim][-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running this code section using one of the WRF files: /glade/collections/cdg/work/cordex/raw/wrf-era-25/1979/2D/wrfout_d01_1979-03-28_00:00:00, but it fails:

Screen Shot 2020-04-21 at 3 08 14 PM

Do you know what's happening?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a set of files from different models that we could use to set up a testing framework for this. I am happy to help set this up.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not tested this on WRF files, but from what I remember, I think this isn't working because Time is just a coordinate and not a variable. So I'll need another way to get the time range for WRF files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andersy005 for looking at this draft. That would be great if we can create a set of files we can test with.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherimickelson When you say "Time is just a coordinate," are you saying "Time is just a dimension"?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ask because in xarray (and according to the CF conventions) a "coordinate" is a type of "variable." And a "dimension" is just a name with a size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmpaul, yes, I'm sorry. It is just a dimension.

Comment on lines 101 to 106
d = nc.Dataset(filepath,'r')
# find what the time (unlimited) dimension is
dims = list(dict(d.dimensions).keys())
look_for_unlim=[d.dimensions[dim].isunlimited() for dim in dims]
unlim=[i for i, x in enumerate(look_for_unlim) if x]
unlim_dim=dims[unlim[0]]
Copy link
Contributor

@andersy005 andersy005 Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to fail when unlim is an empty list:

Screen Shot 2020-04-21 at 5 21 31 PM

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005 good catch. I'll look at catching this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andersy005 I'm using this code to include only time variant fields in the variable list. Should it include all variables instead of limiting what is included?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it include all variables instead of limiting what is included?

Thinking about this a bit, I don't really know whether we should exclude time invariant fields or not. I am going to let others chime in. Cc @kmpaul

If we include only time variant fields in the variable, is there any guarantee that the unlimited dimensions are always specified in the netCDF file?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Usually, time-invariant fields are "pseudo-coordinates"...or metadata fields. But I don't think we can guarantee that is true for all data. Maybe we should conform to xarray's standard practice, which I think is to include everything in the data variables unless it can be unambiguously identified as a "coordinate".

  2. I don't think you can guarantee that "unlimited" dimensions will even exist. My understanding is that some data, for reasons I cannot remember right now, explicitly removes the "unlimited" dimension from all files.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll implement @kmpaul 's suggestion.

@mnlevy1981
Copy link
Contributor

@sherimickelson and I just chatted, and she asked if I could share one of my existing YAML templates and the corresponding csv.gz file. Links are to versions of the file available on github:

/glade/work/mlevy/codes/cesm2-marbl/notebooks/intake-esm-collection-defs/glade-cesm2-cmip6-collection.yaml

was used to produce

/glade/work/mlevy/intake-esm-collection/csv.gz/campaign-cesm2-cmip6-timeseries.csv.gz

The process there was to use a legacy version of intake-esm to generate a netCDF file, and then

/glade/work/mlevy/codes/cesm2-marbl/notebooks/build intake collections.ipynb (it's possible that the cesm2-cmip6 portion of that notebook has been commented out on disk in the time that has passed since generating the files, but I linked an older version)


Some comments from our chat that I think are worth putting in writing here:

  1. Overall, I think this sort of generic interface is exactly what we need to be able to generate catalogs in a consistent manner
  2. It would be useful to have multiple ensemble members be able to share glob_string. If some output is only available for some members (such as ocean BGC output in the large ensemble), intake-esm should use missing values for the members without the data so averaging and things will still work
  3. I'd like a way to add columns to the csv file. E.g. with ensembles, it's useful to track provenance of each run -- knowing what run it split off from, in what year, etc etc

Below is the block defining the SSP5-8.5 ensemble generated by CESM. We include case for each ensemble member, which lets us know what to look for in urlpath (rather than repeating urlpath for each member), and we can see that 001 spun off from historical run 010 while 002 spun off from 011 -- this is very useful for making plots like cell 10 of forcing_iron_flux.ipynb:

forcing_iron_flux

SSP5-8.5:
  locations:
    - name: glade
      loc_type: posix
      direct_access: True
      urlpath: /glade/campaign/collections/cmip/CMIP6/timeseries-cmip6
      exclude_dirs: ['*.nc_temp_.nc']
  extra_attributes:
    component_attrs:
      ocn:
        grid: POP_gx1v7
    case_members:
      - case: b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.001
        ctrl_member_id: 10
        ctrl_experiment: historical
        ctrl_branch_year: 2015
      - case: b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.002
        ctrl_member_id: 11
        ctrl_experiment: historical
        ctrl_branch_year: 2015

@sherimickelson
Copy link
Author

Look into adding a YAML validator. Possible choices:
https://github.com/23andMe/Yamale
https://pykwalify.readthedocs.io/en/master/
Others?

Copy link

@kmpaul kmpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really nice. Is there a test for this?

builders/tslice.py Outdated Show resolved Hide resolved
Comment on lines 157 to 158
filelist = get_asset_list(stream_info['glob_string'], depth=0)
stream_info.pop('glob_string')
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make use of the pop call:

Suggested change
filelist = get_asset_list(stream_info['glob_string'], depth=0)
stream_info.pop('glob_string')
glob_string = stream_info.pop('glob_string')
filelist = get_asset_list(glob_string, depth=0)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests for this yet. I just wanted to push my latest working version.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @kmpaul

builders/tslice.py Outdated Show resolved Hide resolved
builders/tslice.py Outdated Show resolved Hide resolved
builders/tslice.py Outdated Show resolved Hide resolved
builders/tslice.py Outdated Show resolved Hide resolved
sherimickelson and others added 5 commits May 14, 2020 11:15
Thanks for catching this.  This was left over from an earlier version.

Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
Copy link

@kmpaul kmpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed something.

builders/tslice.py Outdated Show resolved Hide resolved
Co-authored-by: Kevin Paul <kpaul@ucar.edu>
@sherimickelson
Copy link
Author

Example yaml file that contains the nested globs:

SSP5-8.5:
    ctrl_experiment: historical
    ensemble:
    - glob_string: /glade/collections/cdg/timeseries-cmip6//b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.101/*/*/*/*/*.nc
      experiment_name: b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.101 
      member_id: '001'
      ctrl_member_id: 10
      ctrl_branch_year: 2015
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.102/*/*/*/*/*.nc
      experiment_name: b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.102
      member_id: '002'
      ctrl_member_id: 11
      ctrl_branch_year: 2015
    data_sources:
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.*/atm/proc/tseries/month_1/*.cam.h0.*
      model_name: cam
      time_freq: month_1
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.*/lnd/proc/tseries/month_1/*clm2.h0.*
      model_name: clm
      time_freq: month_1
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.*/ocn/proc/tseries/month_1/*pop.h.*
      model_name: pop
      time_freq: month_1
    - glob_string: /glade/collections/cdg/timeseries-cmip6/b.e21.BSSP585cmip6.f09_g17.CMIP6-SSP5-8.5.*/ice/proc/tseries/month_1/*cice.h.*
      model_name: cice
      time_freq: month_1

This format requires the key,ensemble, if an ensemble exists. It always requires the key, data_sources. Both keys expect to have lists of dictionaries and each of those must contain the key, glob_string.
Beyond the above requirements, users can add whatever key/value pairs they would like. These key/value pairs are assigned to anything lower in the data structure. For example, the key/value pair, ctrl_experiment: historical, in the above yaml file, will be assigned to all rows under SSP5-8.5.

@sherimickelson
Copy link
Author

Created a schema from the yaml interface that validates with Yamale.
It validates on the command line

yamale -s generic_schema.yaml my_catalog.yaml

Example yaml:

experiment:
    ctrl_experiment: piControl
    ensemble:
    - glob_string: /home/user/CESM_DATA/historical.001/*/*/*.nc
      experiment_name: historical.001 
      member_id: '001'
      ctrl_branch_year: 631
    - glob_string: /home/user/CESM_DATA/historical.002/*/*/*.nc
      experiment_name: historical.002
      member_id: '002'
      ctrl_branch_year: 661
    data_sources:
    - glob_string: /home/user/CESM_DATA/historical.*/atm/hist/*.cam.h0.*
      model_name: cam
      time_freq: month_1
    - glob_string: /home/user/CESM_DATA/historical.*/lnd/hist/*clm2.h0.*
      model_name: clm
      time_freq: month_1
    - glob_string: /home/user/CESM_DATA/historical.*/ocn/hist/*pop.h.*
      model_name: pop
      time_freq: month_1
    - glob_string: /home/user/CESM_DATA/historical.*/ice/hist/*cice.h.*
      model_name: cice
      time_freq: month_1

experiment:
    experiment_name: mountain_wave
    member_id: '001'
    data_sources:
    - glob_string: /home/user/MPAS_DATA/mountain_wave/*
      model_name: mpas
      time_freq: model_step

experiment:
    experiment_name: wrf_test
    member_id: '001'
    data_sources:
    - glob_string: /home/user/WRF_DATA/wrf-era-25/1979/*/wrfout_*
      model_name: wrf
      time_freq: model_step

Validates with this schema:

experiment: 
    ensemble: list(include('ensembleL'), required=False)
    data_sources: list(include('data_sourcesL'))
---
ensembleL:
    glob_string: str()
data_sourcesL:
    glob_string: str()

The validation step will be added to the Python code in builders/tslice.py in a future version.

@sherimickelson
Copy link
Author

While adding the schema validation into the code, a couple of issues showed up. This required a modification to the example yaml as well as the schema. The new versions follow:

Example yaml that defines what should be in the catalog

catalog:
- experiment:
  ctrl_experiment: piControl
  ensemble:
  - glob_string: /home/user/CESM_DATA/historical.001/*/*/*.nc
    experiment_name: historical.001 
    member_id: '001'
    ctrl_branch_year: 631
  - glob_string: /home/user/CESM_DATA/historical.002/*/*/*.nc
    experiment_name: historical.002
    member_id: '002'
    ctrl_branch_year: 661
  data_sources:
  - glob_string: /home/user/CESM_DATA/historical.*/atm/hist/*.cam.h0.*
    model_name: cam
    time_freq: month_1
  - glob_string: /home/user/CESM_DATA/historical.*/lnd/hist/*.clm2.h0.*
    model_name: clm
    time_freq: month_1
  - glob_string: /home/user/CESM_DATA/historical.*/ocn/hist/*.pop.h.*
    model_name: pop
    time_freq: month_1
  - glob_string: /home/user/CESM_DATA/historical.*/ice/hist/*.cice.h.*
    model_name: cice
    time_freq: month_1

- experiment:
  experiment_name: mountain_wave
  member_id: '001'
  data_sources:
  - glob_string: /home/user/MPAS_DATA/mountain_wave/*
    model_name: mpas
    time_freq: model_step

- experiment:
  experiment_name: wrf_test
  member_id: '001'
  data_sources:
  - glob_string: /home/user/WRF_DATA/wrf-era-25/1979/*/wrfout_*
    model_name: wrf
    time_freq: model_step

The new schema used to validate

catalog: list(include('experiment'))
---
experiment:
    ensemble: list(include('ensembleL'), required=False)
    data_sources: list(include('data_sourcesL'))
ensembleL:
    glob_string: str()
data_sourcesL:
    glob_string: str()

@sherimickelson
Copy link
Author

Code now is able to validate the input yaml against the schema with Yamale and internally if Yamale is not available. Code checks for a successful import, if it fails, it does the validation internally.

@sherimickelson
Copy link
Author

Based on discussions I've added the ability to use netcdf file/variable attributes within the yaml file. This is related to #64

The yaml syntax to use this feature is:

  - glob_string: /home/user/CESM_DATA/historical.*/ocn/hist/*.pop.h.*
    model_name: pop
    time_freq: <time_period_freq>
    long_name: <<long_name>>
    units: <<units>>

<var> implies to use the value for global attribute 'var'
<<var>> implies to use the value for the variable attribute 'var'

@andersy005
Copy link
Contributor

Closing this as it appears to have been addressed in ncar-xdev/ecgtools#5

@andersy005 andersy005 closed this Sep 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants