Should unspecified optional strings be the empty string or NaN? #25

dilpath · 2020-06-18T18:12:43Z

At the moment, there can be NaNs (after pd.read_csv) in optional PEtab string columns, such as observableNames, that, if interpreted as a string, are converted to the string literal 'nan'.

>>> import numpy as np
>>> str(np.nan)
'nan'

An issue can occur in the AMICI plotting functions. This issue can be fixed by replacing

elif model.getObservableNames()[iy] != '':

with

elif model.getObservableNames()[iy] in ['', 'nan']:

to correctly identify unspecified observable names. However, testing for the string 'nan' seems unintuitive, and this fix might cause another issue if an observable is named 'nan'.

Here's a solution, which could be implemented in PEtab, and might resolve the issue in AMICI.

$ cat test_str.csv
observableId    observableName
a_id    a_name
b_id

>>> import pandas as pd
>>> df1 = pd.read_csv('test_str.csv', sep='\t')
>>> df2 = pd.read_csv('test_str.csv', sep='\t')
>>> df2['observableName'] = df2['observableName'].fillna('')
>>> df1
  observableId observableName
0         a_id         a_name
1         b_id            NaN
>>> df2
  observableId observableName
0         a_id         a_name
1         b_id

The text was updated successfully, but these errors were encountered:

yannikschaelte · 2020-06-18T18:22:57Z

in general, all of '', 'nan' (, 'NaN', ...) should be allowed as input in the dataframe and interpreted as "Nothing". I am not sure why the observable name is read in as a NaN here, I would have thought the column is interpreted as string. But apparently not ... . The question would be where to put this conversion. At the moment, we perform little value checking when reading in the csv files in petab, so one could do it in AMICI, or here but then also for other potentially problematic columns ... preferences @dweindl @LeonardSchmiester ?

dilpath · 2020-06-18T21:02:03Z

I would have thought the column is interpreted as string

The columns are correctly identified as object type, as opposed to the int64 or float64 types of other columns, but missing values are still replaced with NaN in the pandas dataframe. Specifying the column type in the read_csv command also doesn't stop the insertion of NaNs.

>>> df1.dtypes
observableId      object
observableName    object
dtype: object
>>> df3 = pd.read_csv('test_str.csv', sep='\t', dtype={'observableName': str})
>>> df3.dtypes
observableId      object
observableName    object
dtype: object
>>> df3
  observableId observableName
0         a_id         a_name
1         b_id            NaN

The question would be where to put this conversion.

One way would be to replace

PARAMETER_DF_OPTIONAL_COLS = [
    PARAMETER_NAME, NOMINAL_VALUE,
    INITIALIZATION_PRIOR_TYPE, INITIALIZATION_PRIOR_PARAMETERS,
    OBJECTIVE_PRIOR_TYPE, OBJECTIVE_PRIOR_PARAMETERS]

at https://github.com/PEtab-dev/PEtab/blob/master/petab/C.py#L78 with

PARAMETER_DF_OPTIONAL_COLS_STR = [
    PARAMETER_NAME,
    INITIALIZATION_PRIOR_TYPE, INITIALIZATION_PRIOR_PARAMETERS,
    OBJECTIVE_PRIOR_TYPE, OBJECTIVE_PRIOR_PARAMETERS]
PARAMETER_DF_OPTIONAL_COLS += [NOMINAL_VALUE]

then insert

parameter_df[PARAMETER_DF_OPTIONAL_COLS_STR] = \
    parameter_df[PARAMETER_DF_OPTIONAL_COLS_STR].fillna('')

here
https://github.com/PEtab-dev/PEtab/blob/master/petab/parameters.py#L37
and here
https://github.com/PEtab-dev/PEtab/blob/master/petab/parameters.py#L41

Similarly for observables/other tables.

dweindl · 2020-06-19T08:05:56Z

One option would be to give explicit data types for each column in read_csv. That may also speed up loading larger files. Not sure if this works if some columns are not present. Another problem is, that some columns may be float or object, depending on the specific file (e.g. condition table with all numeric parameters vs. parameter IDs).

Otherwise, checking everything with pandas.isna instead of comparing to '' should also be okay.

FFroehlich · 2020-10-13T14:23:38Z

fixed in AMICI-dev/AMICI@6dfac95 although I really don't think this should be fixed in amici.

Ensuring that names are strings would be a good start and shouldn't be too much trouble.

dweindl transferred this issue from PEtab-dev/PEtab Mar 18, 2021

plakrisenko added the question Further information is requested label May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should unspecified optional strings be the empty string or NaN? #25

Should unspecified optional strings be the empty string or NaN? #25

dilpath commented Jun 18, 2020

yannikschaelte commented Jun 18, 2020

dilpath commented Jun 18, 2020

dweindl commented Jun 19, 2020

FFroehlich commented Oct 13, 2020

Should unspecified optional strings be the empty string or NaN? #25

Should unspecified optional strings be the empty string or NaN? #25

Comments

dilpath commented Jun 18, 2020

yannikschaelte commented Jun 18, 2020

dilpath commented Jun 18, 2020

dweindl commented Jun 19, 2020

FFroehlich commented Oct 13, 2020