Discard or keep non-enhancers sequences? #27
-
@meuleman @lucapinello Keep just the intergenic/intronic accessible regions (use the last genome version to exclude promoter and exons intersection)? Future: |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
Thanks for bringing this up Lucas. Due to the nature of how these sequences were curated, for almost all components there should not be a real issue with them being "promoters" at least, since we select these based on the component "purity", which in most cases points to a certain amount of cell type selectivity not typically observed in promoters (which tend to be constitutively accessible). That said, for similar reasons, you will absolutely find many promoters in the "Tissue-invariant" component, and this is something to keep in mind. We could consider discarding this entire component for this reason, although I would argue against this, as it is still worthwhile to include to better understand differences in performance. More generally, I would not be in favor of filtering out regions based on gene annotations at this point; as it stands, our knowledge of what "makes" an enhancer is extremely limited, part of the objectives of this project, and to introduce bias this early on is not a good idea AFAIAC. That said, if we do want to apply a filtering, I suggest it would be a more data-driven approach. For instance, through filtering based on large chromatin state datasets. For instance, regions could be "disqualified" based on being in a Promoter/TSS chromatin state in all/most/some cell types and states. The conditioning approach you suggest I think is a worthwhile avenue to pursue: this could pull in other datasets, including the chromatin state annotations I just mentioned. tl;dr -- As it stands, I suggest sticking with the current set and doing a post-mortem on the resulting models and their performance, rather than biasing/filtering at this point. |
Beta Was this translation helpful? Give feedback.
-
I agree to not filter for now and to potentially use additional annotations later to select enhancer like elements (e.g. we could use overlapping cage-seq, H3k27ac, chromatin states) |
Beta Was this translation helpful? Give feedback.
Thanks for bringing this up Lucas.
Do you have numbers associated with how often, for each component separately, the current sequences appear to be "non-enhancers"?
Due to the nature of how these sequences were curated, for almost all components there should not be a real issue with them being "promoters" at least, since we select these based on the component "purity", which in most cases points to a certain amount of cell type selectivity not typically observed in promoters (which tend to be constitutively accessible).
That said, for similar reasons, you will absolutely find many promoters in the "Tissue-invariant" component, and this is something to keep in mind. We could consider disca…