Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAs in feature matrix #2

Open
jeinson opened this issue Jun 26, 2018 · 1 comment
Open

NAs in feature matrix #2

jeinson opened this issue Jun 26, 2018 · 1 comment

Comments

@jeinson
Copy link

jeinson commented Jun 26, 2018

When generating a matrix of features for RIVER, how do the developers handle situations where no variant near a particular gene has a CADD annotation for features like TFBS or EncOCCombPVal? glmnet cannot handle NAs, but n my dataset 95% of genes have at least one missing feature annotation, so removing such cases would waste most of the data.

Ex:

cHmmTx cHmmTssBiv cHmmHet cHmmBivFlnk cHmmTxFlnk TFBS EncOCCombPVal
GTEX-111YS:ENSG00000007923 0.016 0 0 0 0.000 NA NA
GTEX-117YW:ENSG00000007923 0.000 0 0 0 0.000 NA NA
GTEX-1192X:ENSG00000007923 0.000 0 0 0 0.000 NA NA
GTEX-11EM3:ENSG00000007923 0.000 0 0 0 0.008 NA NA
GTEX-11EQ8:ENSG00000007923 0.000 0 0 0 0.000 NA NA
GTEX-11EQ9:ENSG00000007923 0.016 0 0 0 0.000 NA NA
@ipw012
Copy link
Owner

ipw012 commented Jun 27, 2018

Especially for annotation from ENCODE like chromatin states and TFBS, there are many NAs. In those cases, we used a minimum number (0), which is background. This is also what CADD used in their variant feature imputations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants