Skip to content
This repository has been archived by the owner on Apr 30, 2021. It is now read-only.

Gen cesm catalog #5

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft

Conversation

mnlevy1981
Copy link
Collaborator

Added a script that generates an intake catalog for time series data generated by CESM if pyReshaper was run in post-processing.

$ ./gen_CESM_catalog.py -c /glade/p/cgd/oce/people/mlevy/cases/b.e22b05.B1850.f09_g17.timeseries_output_for_intake/
INFO (_gen_timeseries_catalog): Will catalog files in /glade/p/cgd/oce/people/mlevy/archive/b.e22b05.B1850.f09_g17.timeseries_output_for_intake
INFO (_gen_timeseries_catalog): Creating /glade/p/cgd/oce/people/mlevy/archive/b.e22b05.B1850.f09_g17.timeseries_output_for_intake/intake/cesm_catalog.csv.gz...
$ zcat /glade/p/cgd/oce/people/mlevy/archive/b.e22b05.B1850.f09_g17.timeseries_output_for_intake/intake/cesm_catalog.csv.gz | head -n 4
case,component,stream,variable,start_date,end_date,path,parent_branch_year,child_branch_year,parent_case
b.e22b05.B1850.f09_g17.timeseries_output_for_intake,atm,cam.h0,TAUBLJY,000101,000112,../atm/proc/tseries/month_1/b.e22b05.B1850.f09_g17.timeseries_output_for_intake.cam.h0.TAUBLJY.000101-000112.nc,-1,-1,-
b.e22b05.B1850.f09_g17.timeseries_output_for_intake,atm,cam.h0,num_c2SFWET,000101,000112,../atm/proc/tseries/month_1/b.e22b05.B1850.f09_g17.timeseries_output_for_intake.cam.h0.num_c2SFWET.000101-000112.nc,-1,-1,-
b.e22b05.B1850.f09_g17.timeseries_output_for_intake,atm,cam.h0,dst_c3,000101,000112,../atm/proc/tseries/month_1/b.e22b05.B1850.f09_g17.timeseries_output_for_intake.cam.h0.dst_c3.000101-000112.nc,-1,-1,-

Note that this file assumes intake-esm can handle a relative path from the csv.gz file to the netCDF data.

I can think of several improvements this script needs, some which might belong in this PR and others that might spawn new issue tickets.

  1. Error handling if file name doesn't fit the {casename}.{stream}.{variable}.{start_date}-{end_date}.nc template (e.g. if history files and time series are collocated, which is default behavior of CESM postprocessing)

  2. Backup plan for determining location of time series if pp_config is not available

  3. I don't think

     catalog['parent_branch_year'] = entry_cnt*[run_config['RUN_REFDATE']]
     catalog['child_branch_year'] = entry_cnt*[run_config['RUN_STARTDATE']]
    

    will work for determining branch point if a run is based off a reference case, as they aren't in run_config yet

And I'm sure additional issues will come up, but I wanted to open this PR to advertise that this script is working in tightly controlled instances.

Closes #2

Comment on lines 57 to 61
for var in ['GET_REFCASE', 'RUN_REFCASE']:
run_config[var] = subprocess.check_output('./xmlquery --value {}'.format(var), shell=True)
DOUT_S = subprocess.check_output('./xmlquery --value DOUT_S', shell=True)
if DOUT_S == 'TRUE':
DOUT_S_ROOT = subprocess.check_output('./xmlquery --value DOUT_S_ROOT', shell=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be safe, we may need some error handling here by wrapping these lines a try block. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, error handling would be great. I'll add it to the list :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're relying on the CIME.Case object instead of running these via subprocess, the failure mode is returning None rather than the expected variable value; that's worth checking for, but it'll be an if statement instead of a try / except block.

I didn't like having {filename}... because it looked like we were creating a
.csv.gz... file
Also, first time running with the pre-commit hooks picked up some formatting
changes (unclear why this didn't happen in the CI framework)
Only look for {CASE}*.nc instead of *.nc
This entails opening netCDF files to get the long_name attribute; the current
implementation opens one file per variable name per component, but if a
variable is spread across multiple files it is assumed that the long_name does
not change.

Also created local copies of some of the other data in the catalog to make it
easier to reference between columns (e.g. storing path locally so I don't need
to access catalog['path'][-1] to get the most recent path value)
Right now, the script itself does not use the debug message level but running
with -d will add some environment information to the output.
Comment on lines 59 to 64
try:
os.chdir(case_root)
except:
# TODO: set up logger instead of print statements
logger.error('{} does not exist'.format(case_root))
sys.exit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to replace this with a contextmanager approach:

from contextlib import contextmanager
@contextmanager
def chdir(path):
    """
    Change working directory to `path` and restore it again
    This context manager is useful if `path` stops existing during your
    operations.
    """
    old_dir = os.getcwd()
    os.chdir(path)
    try:
        yield
    finally:
        os.chdir(old_dir)

You can then use it as follows:

with chdir(case_root):
   ... DO SOME WORK

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I'll look into this tomorrow (comment is "outdated" but hasn't been resolved yet)

cesmcatalog/gen_CESM_catalog.py Outdated Show resolved Hide resolved
cesmcatalog/gen_CESM_catalog.py Show resolved Hide resolved
Still use xmlquery to get CIMEROOT, which is needed to import Case.
If we already know cimeroot, no need for xmlquery at all -- this will be useful
once this script is part of the post-processing suite, but for now we fallback
to using xmlquery to set up the path to the CIME python libraries.
Copy link

@jedwards4b jedwards4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks.

Base automatically changed from master to main February 3, 2021 17:47
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tool to generate intake-catalog for CESM runs
3 participants