Best option for derived variables #349

golaz · 2022-09-20T21:32:38Z

golaz
Sep 20, 2022

I'm just getting started with xcdat by trying to convert some of my existing scripts from CDAT to xcdat.

Let's start with a simple and common scenario: read a few variables from separate monthly time series files, calculate annual averages and then compute some derived quantities from them.

Bellow are different methods I tried.

Most logical with clean syntax, but doesn't seem to actually work.
Seems to work, but resulting derived variable is no longer a dataset, which could impede further downstream manipulation.
Works, but only if I reverse order of operations and compute derived variable first before performing annual averages.

Q: Is there a preferred way to accomplish this? Maybe even a better/cleaner method?

(Input files for my test are on Chrysalis.)

import xcdat

# Open dataset (Chrysalis)
files = '/lcrc/group/e3sm/ac.golaz/E3SMv2/v2.LR.1pctCO2_0101/post/atm/glb/ts/monthly/10yr/*.nc'
ds = xcdat.open_mfdataset(files)

# Method (1)
print("------ Method 1 ------")

FSNT = ds.temporal.group_average("FSNT", freq="year", weighted=True)
FLNT = ds.temporal.group_average("FLNT", freq="year", weighted=True)
RESTOM = FSNT - FLNT
print(type(RESTOM))
#--> <class 'xarray.core.dataset.Dataset'>
print(RESTOM)
#--> RESTOM does not seem to contain any actual data

# Method (2)
print("------ Method 2 ------")

FSNT = ds.temporal.group_average("FSNT", freq="year", weighted=True)
FLNT = ds.temporal.group_average("FLNT", freq="year", weighted=True)
RESTOM = FSNT.FSNT - FLNT.FLNT
print(type(RESTOM))
#--> <class 'xarray.core.dataarray.DataArray'>
print(RESTOM.to_masked_array()[0:5])

# Method (3): annual average can only be done at the end
print("------ Method 3 ------")

ds['RESTOM'] = ds.FSNT - ds.FLNT
ds_yearly = ds.temporal.group_average("RESTOM", freq="year", weighted=True)
print(ds_yearly.RESTOM.to_masked_array()[0:5])

pochedls · 2022-09-20T21:47:35Z

pochedls
Sep 20, 2022
Collaborator

Some quick feedback...

In Method 1, I think you're taking the difference of two datasets (so xarray doesn't know that variable FLNT should be subtracted from FSNT). I'm surprised Method 2 yields no output.

This seems related to this discussion [@mzelinka]. Maybe someone with more xarray experience can weigh in – it seems like this will be a very common operation.

2 replies

tomvothecoder Sep 20, 2022
Maintainer

I'm surprised Method 2 yields no output.

@pochedls, method 2 does yield results, but method 1 does not due to what you describe in your comment.

Responses to these points:

Most logical with clean syntax, but doesn't seem to actually work. -- does not work because you have to specify the
exact variables in the Dataset to perform the arithmetic with (related to v0.3.0 Testing Feedback #296 (reply in thread)).
Seems to work, but resulting derived variable is no longer a dataset, which could impede further downstream manipulation. -- correct, this might impede downstream manipulations
Works, but only if I reverse order of operations and compute derived variable first before performing annual averages.

Here are some alternative solutions:

The solutions store the averaged and derived variables in an xr.Dataset object for further operations. I noted a limitation with the xcdat averaging APIs, which is that it can only operate on a single data variable at a time.

import xcdat

# Open dataset (Chrysalis)
files = "/lcrc/group/e3sm/ac.golaz/E3SMv2/v2.LR.1pctCO2_0101/post/atm/glb/ts/monthly/10yr/*.nc"
ds = xcdat.open_mfdataset(files)

# Method 1 -- Add variables to the first dataset produced after averaging.
# -- The limitation with averaging API is that it operates on only one variable
# at time, so we have to do some manual work in adding averaged variables to an
# "averaged" dataset.
print("------ Method 1 ------")

# 1. Generate a dataset with the average of "FSNT"
ds_avg = ds.temporal.group_average("FSNT", freq="year", weighted=True)

# 2. Add average of "FLNT" to the dataset. Notice we extract "FLNT" from the dataset.
ds_avg["FLNT"] = ds.temporal.group_average("FLNT", freq="year", weighted=True)["FLNT"]

# 3. Calculate "RESTOM"
ds_avg["RESTOM"] = ds_avg.FSNT - ds_avg.FLNT

print(ds_avg["RESTOM"].to_masked_array()[0:5])

# Method 2 -- Construct a new Dataset for averaged and derived variables.
print("------ Method 2 ------")
import xarray as xr

# 1. Construct a new Dataset.
ds_avg = xr.Dataset()

# 2. Calculate averages for "FSNT" and "FLTN", and add them to the Dataset
ds_avg["FSNT"] = ds.temporal.group_average("FSNT", freq="year", weighted=True)["FSNT"]
ds_avg["FLNT"] = ds.temporal.group_average("FLNT", freq="year", weighted=True)["FLNT"]

# 3. Calculate the derived variable for "RESTOM"
ds_avg["RESTOM"] = ds_avg.FSNT - ds_avg.FLNT

print(ds_avg["RESTOM"].to_masked_array()[0:5])

golaz Sep 20, 2022
Author

@tomvothecoder : thanks for the suggestions, they are very helpful.

I like your second method which is quite elegant in my view: create an empty dataset, add annual averages to it and finally add the derived quantity.

golaz · 2022-09-21T16:50:55Z

golaz
Sep 21, 2022
Author

Based on my short experience (days) with xcdat, but also reading through #296 and https://xcdat.readthedocs.io/en/latest/api.html#overview, I think one challenge for users transitioning from CDAT to xcdat is that there is nothing truly equivalent to a "TransientVariable". To get the full functionality of xcdat, a user should always work with datasets. However, because datasets can contain multiple data variables, the syntax for manipulating variables becomes a little heavier and less intuitive (see examples above).

I wonder if it would be possible to introduce a distinction between single-variable datasets and multivariable datasets. A single-variable dataset would be more like a CDAT TransientVariable. Is it possible for xcdat to overload arithmetic operations between datasets so that they behave more intuitively when dealing with single variable datasets?

Assume FSNT and FLNT are single-variable datasets, then

RESTOM = FSNT - FLNT

would return a new single-variable dataset with the data content being the difference between FSNT and FLNT. For single-variable datasets, the internal data variable name could be generic (data), so it could be accessed as:

RESTOM.data

If this could be done, I think it would make working with xcdat significantly more intuitive (without breaking any current functionality).

Tagging @tomvothecoder and @pochedls.

2 replies

pochedls Sep 26, 2022
Collaborator

@golaz - I agree with your general comment(!) – it would be nice to have something closer to a transient variable.

I've never done what you suggest (overload arithmetic operations), but doing a little bit of research suggests that this could be possible (a complication is that xcdat does not create a class object – it uses xarray dataarrays and datasets; I think this would require using accessors to define a __add__ and __sub__ method for datasets and having some functionality to infer what variable the user cares about). This would require some investigation.

One thing to add that I do not think was discussed in #296 that is relevant here: There was a period where the xcdat api had you (optionally) specify a variable upon loading the dataset (e.g., ds=xc.open_dataset( filename, data_var=variable). This variable would be assigned as a dataset attribute and used to determine the dataarray of interest in subsequent calculations (e.g., ds.spatial.average() or you could optionally pass in the variable if it was not pre-specified). I think there may have been an attempt to infer the variable, too. In the end, I think there was concern that this could create more problems than it solves. This doesn't solve the RESTOM = FSNT - FLNT issue, but I thought I'd mention it since it is in the area of "can we make datasets/dataarrays more similar to transient variables."

Does having to specify the variable for xcdat functions (e.g., regrid and spatial average) bug you or do you think improving arithmetic operations is a bigger concern?

durack1 Sep 26, 2022
Collaborator

+2 from me @golaz. I agree that the cdms.TransientVariable is something to aspire toward - it's the way much/all of the CMIPx archive is built, so is a good target

golaz · 2022-09-26T22:41:32Z

golaz
Sep 26, 2022
Author

@pochedls : I don't have a good sense of what is possible or not, so thanks for entertaining my suggestions.

I realize that xcdat does not create its own class. In fact, it's rather cool how xcdat adds functionality directly to xarray datasets through accessors. I did not know that was possible. If accessors can also be used to define __add__ and __sub__, that would open up many additional possibilities. In that case, xcdat could introduce the concept of a single-variable dataset, which would be a sub-category of a regular multi-variable dataset. Single-variable datasets could support additional functionality, such as direct addition and subtraction, but also averaging operations without having to specify the variable name (for which the syntax does bother me a little :)).

The API for multi-variable datasets would not be impacted, and could be used as well for single-variables datasets. There would just be an additional (more intuitive) API option for single-variable datasets, something like

FSNT = FSNT.temporal.group_average(freq="year", weighted=True)
FLNT = FLNT.temporal.group_average(freq="year", weighted=True)
RESTOM = FSNT - FLNT

1 reply

pochedls Sep 27, 2022
Collaborator

I messed around with this a little bit. Before I go on...I am pretty this will ultimately not work because I'm violating some important principles of software development...but maybe @tomvothecoder will have a suggestion that is safe and useable in xcdat.

The following works, but it is obviously overly simplistic.

First I defined a (simple) function to infer what data variable we should use for dataset operations (I did this in the xcdat __init__.py file so this activates upon importing xcdat):

def infer_data_var(ds):
    """ placeholder method to define a method
        to infer the data_var of interest.
        This just chooses the one with the most
        elements.
    """
    max_size = 1
    # loop over dataset data_vars
    for dv in ds.data_vars:
        dvshape = ds[dv].shape
        # count the elements in each data_var
        counter = 1
        for coordsize in dvshape:
            counter *= coordsize
        # choose the data_var with the most elements
        if counter > max_size:
            dvkeep = dv
            max_size = counter
    return dvkeep

I also defined an xcdat add operator that figures out the data_var of interest in dataset1 (self) and dataset2 (other):

def __xcadd__(self, other):
    """ function to add two datasets in an
    intuitive way.
    """
    # get data_vars of interest
    dvs = infer_data_var(self)
    dvo = infer_data_var(other)
    # get dataarrays
    x = self[dv]
    y = self[dv]
    # create new dataset and add sum of dataarrays
    ds = self.copy()
    ds[self[dv].name] = x + y
    return ds

I then imported xarray and overrode the xarray.__add__ method:

import xarray as xr
xr.Dataset.__add__ = __xcadd__

I think this has the behavior we want (when I add ds + ds I get a dataset such that 2*ds.tas):

import xcdat as xc

fn = '/p/css03/esgf_publish/CMIP6/CMIP/NCAR/CESM2/historical/r1i1p1f1/Amon/tas/gn/v20190308/tas_Amon_CESM2_historical_r1i1p1f1_gn_185001-201412.nc'
ds = xc.open_dataset(fn)
ds2 = ds + ds
ds2.tas

<xarray.DataArray 'tas' (time: 1980, lat: 192, lon: 288)>
array([[[490.64417, 490.64417, 490.64417, ..., 490.64417, 490.64417,
490.64417],
[492.2119 , 492.12476, 491.8006 , ..., 492.3339 , 492.30038,
492.25146],
[493.4509 , 493.36188, 493.31247, ..., 493.97626, 493.85938,
493.6837 ],

Note that I was also able to add the following at the top of the spatial averager (and make data_var an optional argument):

        if not data_var:
            data_var = self.infer_data_var()

This allowed me to call ds.spatial.average() without specifying the data_var, which is what I would like to do if there is only one data_var worth averaging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best option for derived variables #349

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best option for derived variables #349

golaz Sep 20, 2022

Replies: 3 comments · 5 replies

pochedls Sep 20, 2022 Collaborator

tomvothecoder Sep 20, 2022 Maintainer

Responses to these points:

Here are some alternative solutions:

golaz Sep 20, 2022 Author

golaz Sep 21, 2022 Author

pochedls Sep 26, 2022 Collaborator

durack1 Sep 26, 2022 Collaborator

golaz Sep 26, 2022 Author

pochedls Sep 27, 2022 Collaborator

golaz
Sep 20, 2022

Replies: 3 comments 5 replies

pochedls
Sep 20, 2022
Collaborator

tomvothecoder Sep 20, 2022
Maintainer

golaz Sep 20, 2022
Author

golaz
Sep 21, 2022
Author

pochedls Sep 26, 2022
Collaborator

durack1 Sep 26, 2022
Collaborator

golaz
Sep 26, 2022
Author

pochedls Sep 27, 2022
Collaborator