lack of explicit separator in read_csv in /mutate/protocol.py #272

paulaberry · 2022-02-03T22:51:27Z

Because there is not an explicit separator in the function that imports the mutation data table in /mutate/protocol.py, pandas expects a comma separator, but the /mutate/calculations.py function that deals with that table expects commas to separate the locations of mutations. This leads to an error that stops the pipeline if you use commas for both field separators and mutation separators. It also leads to a key error in pandas if you format the mutation data table as suggested, with semicolons as field separators in the mutation data table.

I fixed this issue on my installation of evcouplings by changing line 126 on mutate/protocol.py from data = pd.read_csv(dataset_file, comment="#") to data = pd.read_csv(dataset_file, comment="#", sep=";")

The text was updated successfully, but these errors were encountered:

thomashopf · 2022-02-04T08:42:42Z

Hi @paulaberry,

In your csv file, are the fields containing commas escaped/quoted, like in the following example file?
test_mutants.csv

This way, which is also how pandas' to_csv() function saves, you can have commas in comma-separated files.

paulaberry · 2022-02-04T15:42:17Z

This file: https://github.com/debbiemarkslab/EVcouplings/blob/develop/notebooks/example/PABP_YEAST_Fields2013-singles.csv was the only one I could find as an example for how to format a mutation effects data table, and is the one referenced in your mutation effects documentation. Are there two different functions used for the same calculations when using EVCouplings as a python package vs the command line interface where this difference in formatting should be used?

thomashopf · 2022-02-06T10:55:14Z

Sorry, this is an unfortunate mismatch between the mutation effect documentation and overall pipeline usage. I tagged this issue to fix the documentation example so it doesn't lead to misunderstandings in the future.

In the example notebook there is an explicit sep=";" argument in the pd.read_csv() , while the pipeline defaults to sep="," when reading the csv file (which is the intended behaviour to have csv files handled consistently across the pipeline, the example file dates back to an older set of files). The actual prediction functions applied afterwards are the same.

So as solution I propose to use a file that is formatted like the test_mutants.csv file I posted above, where any strings containing commas are wrapped in quotation marks, which I think is the de facto standard for handling this case.

thomashopf added the enhancement label Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lack of explicit separator in read_csv in /mutate/protocol.py #272

lack of explicit separator in read_csv in /mutate/protocol.py #272

paulaberry commented Feb 3, 2022

thomashopf commented Feb 4, 2022 •

edited

Loading

paulaberry commented Feb 4, 2022

thomashopf commented Feb 6, 2022 •

edited

Loading

lack of explicit separator in read_csv in /mutate/protocol.py #272

lack of explicit separator in read_csv in /mutate/protocol.py #272

Comments

paulaberry commented Feb 3, 2022

thomashopf commented Feb 4, 2022 • edited Loading

paulaberry commented Feb 4, 2022

thomashopf commented Feb 6, 2022 • edited Loading

thomashopf commented Feb 4, 2022 •

edited

Loading

thomashopf commented Feb 6, 2022 •

edited

Loading