NAs in feature matrix #2

jeinson · 2018-06-26T20:12:13Z

When generating a matrix of features for RIVER, how do the developers handle situations where no variant near a particular gene has a CADD annotation for features like TFBS or EncOCCombPVal? glmnet cannot handle NAs, but n my dataset 95% of genes have at least one missing feature annotation, so removing such cases would waste most of the data.

Ex:

	cHmmTx	cHmmTxFlnk	TFBS	EncOCCombPVal
GTEX-111YS:ENSG00000007923	0.016	0.000	NA	NA
GTEX-117YW:ENSG00000007923	0.000	0.000	NA	NA
GTEX-1192X:ENSG00000007923	0.000	0.000	NA	NA
GTEX-11EM3:ENSG00000007923	0.000	0.008	NA	NA
GTEX-11EQ8:ENSG00000007923	0.000	0.000	NA	NA
GTEX-11EQ9:ENSG00000007923	0.016	0.000	NA	NA

ipw012 · 2018-06-27T21:06:04Z

Especially for annotation from ENCODE like chromatin states and TFBS, there are many NAs. In those cases, we used a minimum number (0), which is background. This is also what CADD used in their variant feature imputations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NAs in feature matrix #2

NAs in feature matrix #2

jeinson commented Jun 26, 2018 •

edited

Loading

ipw012 commented Jun 27, 2018

NAs in feature matrix #2

NAs in feature matrix #2

Comments

jeinson commented Jun 26, 2018 • edited Loading

ipw012 commented Jun 27, 2018

jeinson commented Jun 26, 2018 •

edited

Loading