-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregating results across pySCENIC runs #169
Comments
Hi @d93espinoza , The response to #157 is generally what I would do, and I expanded on this a little in the documentation here, point 2. The Nextflow pipeline does this for you semi-automatically but it sounds like you've already done the multiple runs. The approach I've taken is a bit different from yours though. There are two items of interest: the regulons themselves (TFs) and the target genes of each regulon. First, I do an outer join, keeping all regulons/targets. Then, I count the number of times each regulon/target occurred. This will give a distribution and some idea of how much variability there is. You can then select a threshold as needed. For example, in a 100x run, you'll find some regulons that occur 100% of the time, and others that only occur once (obviously these are low confidence). Generally I've taken a threshold of ~80% occurrence, but I still find it useful to know that other regulons were found, but didn't pass the threshold (in case there is a specific regulon you expect to find, but don't). The same logic applies for the target genes. Hope that helps, I'm interested to hear other approaches also. |
Thank you for your response! Looking at the distributions as you mentioned are what I will do, prior to choosing a threshold. Also would be curious to hear other approaches, but this certainly covers what I was looking for! |
Hi all, I have another question regarding aggregation. Since this thread is pinned I thought to post it right here, but let me know if I should open a different issue. My question is on how to correctly perform the aggregation of results from the pyscenic docker image. Currently, I run the grn and ctx steps with:
At this point, what would be the cleanest way to integrate the different regulons files? In this notebook the following function is used (adapted for hg38): def derive_regulons(
motifs,
db_names=(
'hg38__refseq-r80__10kb_up_and_down_tss.mc9nr',
'hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr'
)
):
from cytoolz import compose
import numpy as np
import operator as op
from pyscenic.transform import df2regulons
motifs.columns = motifs.columns.droplevel(0)
def contains(*elems):
def f(context):
return any(elem in context for elem in elems)
return f
# For the creation of regulons we only keep the 10-species databases and
# the activating modules. We also remove the
# enriched motifs for the modules that were created using the method
# 'weight>50.0%' (because these modules are not part
# of the default settings of modules_from_adjacencies anymore.
motifs = motifs[
np.fromiter(
map(compose(op.not_, contains('weight>50.0%')), motifs.Context),
dtype=np.bool
) & \
np.fromiter(map(contains(*db_names), motifs.Context), dtype=np.bool) & \
np.fromiter(map(contains('activating'), motifs.Context), dtype=np.bool)
]
# We build regulons only using enriched motifs with a NES of 3.0 or higher;
# we take only directly annotated TFs or TF annotated
# for an orthologous gene into account; and we only keep regulons with at
# least 10 genes.
regulons = list(
filter(lambda r: len(r) >= 10,
df2regulons(motifs[
(motifs['NES'] >= 3.0) & (
(motifs['Annotation'] == 'gene is directly annotated') | (
motifs['Annotation'].str.startswith(
'gene is orthologous to'
)
& motifs['Annotation'].str.endswith(
'which is directly annotated for motif'
)
)
)
])
)
)
# Rename regulons, i.e. remove suffix.
return list(map(lambda r: r.rename(r.transcription_factor), regulons)) However, is it the same process that is used by aucell within the pyscenic docker image? To further elaborate, from the current version of the documentation, the python code to get regulons from grn results is the following: modules = list(modules_from_adjacencies(adjacencies, ex_matrix))
# Calculate a list of enriched motifs and the corresponding target genes for all modules.
with ProgressBar():
df = prune2df(dbs, modules, MOTIF_ANNOTATIONS_FNAME)
# Create regulons from this table of enriched motifs.
regulons = df2regulons(df) Is the results from the docker motifs = load_motifs('ctx_output.csv')
regulons = df2regulons(motifs)
r_df = None
for r in regulons:
ks = []
for k,v in r.gene2weight.items():
ks += [k]
r_df = pd.concat([r_df, pd.DataFrame({r.name: ks})], axis=1)
r_df.fillna('').to_csv('data.regulons.csv') And subsequent selections of regulons can be based on how many times a target gene appears on the different iterations. However, in this case the format of the regulons file is different from what is expected by aucell, so the last step can be executed only within python? E.g.: auc_mtx = aucell(ex_matrix, regulons, num_workers=4) What would be the best way to integrate the n runs of grn and ctx, select top target genes for each regulon to present to pyscenic aucell, and generate a dataframe where each column represents a regulon (with target genes as values)? Thanks, |
Hi @fbrundu , For integrating multiple runs of scenic, there is a semi-automated implementation in our collection of Nextflow pipelines (VSN Pipelines). So it might be easiest to use this approach (see here for details). But given that you've already started with the grn+ctx steps you could maybe integrate these "manually" using the existing scripts we have. Take a look at aggregate_multi_runs_regulons.py, and aggregate_multi_runs_features.py although this starts from the AUCell loom file output. But for your approach above, you could also take the ctx output
Access all of the TFs (in one run) with
You can set the |
Hi @cflerin , Thank you for the information. I had looked at the nextflow pipelines but since I have no experience with Nextflow I decided not to use them for now. I resorted to a similar approach you outlined (given I had already computed the grn+ctx steps). Briefly, I filtered regulons and target genes and I generated a GMT file as input for the AUCell step (assuming that due to the integration, gene weights were difficult to compute), code below (3 runs): from collections import Counter
import pandas as pd
from pyscenic.utils import load_motifs
from pyscenic.transform import df2regulons
# Load motifs
motifs = []
for i in range(1, 4):
motifs.append(load_motifs(f'{prefix}.regulons.{i}.csv'))
# Collect regulons
regulons = []
for regulons_l in map(df2regulons, motifs):
single = None
for r in regulons_l:
ks = []
for k, v in r.gene2weight.items():
ks += [k]
single = pd.concat([single, pd.DataFrame({r.name: ks})], axis=1)
regulons.append(single.fillna(''))
# Get regulons identified at least in 2 runs (out of 3)
columns = Counter([
e for l in map(lambda x: list(x.columns), regulons) for e in l
])
columns = [e for e in columns if columns[e] > 1]
# Get target genes identified at least in 2 runs for each regulons
stable_regulons = None
for c in columns:
targets = []
for r in regulons:
if c in r.columns:
targets.extend(filter(lambda x: x != '', r[c]))
targets = Counter(targets)
targets = [e for e in targets if targets[e] > 1]
stable_regulons = pd.concat([
stable_regulons, pd.DataFrame({c: targets})
], axis=1)
# List of target genes per stable regulons
stable_regulons.to_csv(f'{prefix}.regulons.final.csv', index=False)
# Input data for AUCell
with open(f'{prefix}.regulons.final.gmt', 'wt') as gmt:
for c in stable_regulons.columns:
genes = '\t'.join(stable_regulons.loc[:, c].dropna())
gmt.write(f'{c}\tRegulon_{c}\t{genes}\n') I am not sure if Thanks |
Hi @cflerin, For multi-run analysis, do you calculate an AUCell score on the "final" regulons, after filtering for regulons that occur above a threshold and also for targets that are found within that regulon e.g. 80% of the time? Best wishes, |
Hi!
I've seen a couple threads (#136 #157) that mention running the pySCENIC pipeline multiple times and then aggregating the regulons across runs.
My current approach to this has been to take the regulons after the pruning step and perform an inner-type merge on them where I only keep TF-Target pairs that are common across all runs. However, I wanted to know if the authors or others had other approaches that they've found useful?
Thanks!
The text was updated successfully, but these errors were encountered: