Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvalidIndexError when running ml code #188

Open
ntalluri opened this issue Oct 7, 2024 · 4 comments
Open

InvalidIndexError when running ml code #188

ntalluri opened this issue Oct 7, 2024 · 4 comments

Comments

@ntalluri
Copy link
Collaborator

ntalluri commented Oct 7, 2024

I encountered a InvalidIndexError running the ml code during individual runs of parameter sweeps on mincostflow on the EGFR dataset. The issue happens in the summarize_networks function, related to reindexing, where non-unique index values are causing problems with pandas concatenation.

# initially construct separate dataframes per algorithm
   edge_dataframes = []
   # the dataframe is set up per algorithm and a 1 is set for the edge pair that exists in the algorithm
   for tup in edge_tuples:
       dataframe = pd.DataFrame(
           {
               str(tup[0]): 1,
           }, index=tup[1]
       )
       edge_dataframes.append(dataframe)

   # concatenating all the algorithm-specific dataframes together
   # (0 is set for all the edge pairs that don't exist per algorithm)
   concated_df = pd.concat(edge_dataframes, axis=1, join='outer')
   concated_df = concated_df.fillna(0)
   concated_df = concated_df.astype('int64')

Error Trace:

RuleException:
InvalidIndexError in file /Users/nehatalluri/Desktop/research/spras/Snakefile, line 315:
Reindexing only valid with uniquely valued Index objects
  File "/Users/nehatalluri/Desktop/research/spras/Snakefile", line 315, in __rule_ml_analysis
  File "/Users/nehatalluri/Desktop/research/spras/spras/analysis/ml.py", line 85, in summarize_networks
  File "/Users/nehatalluri/anaconda3/envs/spras/lib/python3.11/site-packages/pandas/util/_decorators.py", line 331, in wrapper
  File "/Users/nehatalluri/anaconda3/envs/spras/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 381, in concat
  File "/Users/nehatalluri/anaconda3/envs/spras/lib/python3.11/site-packages/pandas/core/reshape/concat.py", line 612, in get_result
  File "/Users/nehatalluri/anaconda3/envs/spras/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 3904, in get_indexer
@ntalluri ntalluri changed the title InvalidIndexError when running ml_analysis InvalidIndexError when running ml code Oct 7, 2024
@ntalluri
Copy link
Collaborator Author

ntalluri commented Oct 25, 2024

The issue is because the pathways generated by mincostflow contain duplicate edges in the pathway.txt file. This seems to occur in many of the output pathways from mincostflow. This seems to be an issue with the mincostflow code along with the ml code.

Here is an example of one the pathway.txt files with duplicate edges

Node1 Node2 Rank Direction
EGFR_HUMAN EGF_HUMAN 1 U
S10A4_HUMAN EGF_HUMAN 1 U
HDAC6_HUMAN EGF_HUMAN 1 U
HS90A_HUMAN HDAC6_HUMAN 1 U
KS6A3_HUMAN SRC_HUMAN 1 U
SRC_HUMAN EMD_HUMAN 1 U
FYN_HUMAN KS6A3_HUMAN 1 U
CBL_HUMAN EGFR_HUMAN 1 U
MYH9_HUMAN S10A4_HUMAN 1 U
EGFR_HUMAN EGF_HUMAN 1 U
LMNA_HUMAN EGF_HUMAN 1 U
S10A4_HUMAN EGF_HUMAN 1 U
HDAC6_HUMAN EGF_HUMAN 1 U
GRB2_HUMAN EGF_HUMAN 1 U
HS90A_HUMAN HDAC6_HUMAN 1 U
CBL_HUMAN GRB2_HUMAN 1 U
CBL_HUMAN EGFR_HUMAN 1 U
MYH9_HUMAN S10A4_HUMAN 1 U
EMD_HUMAN LMNA_HUMAN 1 U

@ntalluri
Copy link
Collaborator Author

ntalluri commented Oct 25, 2024

#191 this PR contains a test case that shows the error and how the error comes up

@ntalluri
Copy link
Collaborator Author

ntalluri commented Oct 25, 2024

My fix to this problem is to remove duplicate edges when the dataframes are being created before concating them together


for tup in edge_tuples:
       dataframe = pd.DataFrame(
           {
               str(tup[0]): 1,
           }, index=tup[1]
       )
      # drop duplicates  index code
       edge_dataframes.append(dataframe)

@ntalluri
Copy link
Collaborator Author

In the end, this was not a SPRAS problem other than the fact the code wasn't robust about duplicate indices in the dataframes. The true problem lies in the mincostflow code implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant