Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should unspecified optional strings be the empty string or NaN? #25

Open
dilpath opened this issue Jun 18, 2020 · 4 comments
Open

Should unspecified optional strings be the empty string or NaN? #25

dilpath opened this issue Jun 18, 2020 · 4 comments
Labels
question Further information is requested

Comments

@dilpath
Copy link
Member

dilpath commented Jun 18, 2020

At the moment, there can be NaNs (after pd.read_csv) in optional PEtab string columns, such as observableNames, that, if interpreted as a string, are converted to the string literal 'nan'.

>>> import numpy as np
>>> str(np.nan)
'nan'

An issue can occur in the AMICI plotting functions. This issue can be fixed by replacing

elif model.getObservableNames()[iy] != '':

with

elif model.getObservableNames()[iy] in ['', 'nan']:

to correctly identify unspecified observable names. However, testing for the string 'nan' seems unintuitive, and this fix might cause another issue if an observable is named 'nan'.

Here's a solution, which could be implemented in PEtab, and might resolve the issue in AMICI.

$ cat test_str.csv
observableId    observableName
a_id    a_name
b_id    
>>> import pandas as pd
>>> df1 = pd.read_csv('test_str.csv', sep='\t')
>>> df2 = pd.read_csv('test_str.csv', sep='\t')
>>> df2['observableName'] = df2['observableName'].fillna('')
>>> df1
  observableId observableName
0         a_id         a_name
1         b_id            NaN
>>> df2
  observableId observableName
0         a_id         a_name
1         b_id               
@yannikschaelte
Copy link
Member

in general, all of '', 'nan' (, 'NaN', ...) should be allowed as input in the dataframe and interpreted as "Nothing". I am not sure why the observable name is read in as a NaN here, I would have thought the column is interpreted as string. But apparently not ... . The question would be where to put this conversion. At the moment, we perform little value checking when reading in the csv files in petab, so one could do it in AMICI, or here but then also for other potentially problematic columns ... preferences @dweindl @LeonardSchmiester ?

@dilpath
Copy link
Member Author

dilpath commented Jun 18, 2020

I would have thought the column is interpreted as string

The columns are correctly identified as object type, as opposed to the int64 or float64 types of other columns, but missing values are still replaced with NaN in the pandas dataframe. Specifying the column type in the read_csv command also doesn't stop the insertion of NaNs.

>>> df1.dtypes
observableId      object
observableName    object
dtype: object
>>> df3 = pd.read_csv('test_str.csv', sep='\t', dtype={'observableName': str})
>>> df3.dtypes
observableId      object
observableName    object
dtype: object
>>> df3
  observableId observableName
0         a_id         a_name
1         b_id            NaN

The question would be where to put this conversion.

One way would be to replace

PARAMETER_DF_OPTIONAL_COLS = [
    PARAMETER_NAME, NOMINAL_VALUE,
    INITIALIZATION_PRIOR_TYPE, INITIALIZATION_PRIOR_PARAMETERS,
    OBJECTIVE_PRIOR_TYPE, OBJECTIVE_PRIOR_PARAMETERS]

at https://github.com/PEtab-dev/PEtab/blob/master/petab/C.py#L78 with

PARAMETER_DF_OPTIONAL_COLS_STR = [
    PARAMETER_NAME,
    INITIALIZATION_PRIOR_TYPE, INITIALIZATION_PRIOR_PARAMETERS,
    OBJECTIVE_PRIOR_TYPE, OBJECTIVE_PRIOR_PARAMETERS]
PARAMETER_DF_OPTIONAL_COLS += [NOMINAL_VALUE]

then insert

parameter_df[PARAMETER_DF_OPTIONAL_COLS_STR] = \
    parameter_df[PARAMETER_DF_OPTIONAL_COLS_STR].fillna('')

here
https://github.com/PEtab-dev/PEtab/blob/master/petab/parameters.py#L37
and here
https://github.com/PEtab-dev/PEtab/blob/master/petab/parameters.py#L41

Similarly for observables/other tables.

@dweindl
Copy link
Member

dweindl commented Jun 19, 2020

One option would be to give explicit data types for each column in read_csv. That may also speed up loading larger files. Not sure if this works if some columns are not present. Another problem is, that some columns may be float or object, depending on the specific file (e.g. condition table with all numeric parameters vs. parameter IDs).

Otherwise, checking everything with pandas.isna instead of comparing to '' should also be okay.

@FFroehlich
Copy link
Collaborator

fixed in AMICI-dev/AMICI@6dfac95 although I really don't think this should be fixed in amici.

Ensuring that names are strings would be a good start and shouldn't be too much trouble.

@dweindl dweindl transferred this issue from PEtab-dev/PEtab Mar 18, 2021
@plakrisenko plakrisenko added the question Further information is requested label May 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants