Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update gene set enrichment functionality #29

Open
3 of 4 tasks
adamklie opened this issue Aug 22, 2024 · 3 comments
Open
3 of 4 tasks

Update gene set enrichment functionality #29

adamklie opened this issue Aug 22, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request evaluations

Comments

@adamklie
Copy link
Collaborator

adamklie commented Aug 22, 2024

  • Take in low threshold of program weights to remove prior to run
  • Add in capability to use top N genes
  • Rename columns to standardized set (see below)
  • Reconfigure dashapp to load multiple results and have a dropdown to select which one to use)
@adamklie adamklie added the enhancement New feature or request label Aug 22, 2024
@adamklie
Copy link
Collaborator Author

adamklie commented Sep 6, 2024

New prerank implementation:

def perform_prerank(
    loadings: pd.DataFrame,
    geneset: Union[List[str], str, Dict[str, str]],
    n_jobs: int = 1,
    low_cutoff: float = -np.inf,
    n_top: Optional[int] = None,
    **kwargs
) -> pd.DataFrame:
    """Run GSEA prerank on each gene program in the loadings matrix.
    
    Parameters
    ----------
    loadings : pd.DataFrame
        DataFrame of feature loadings for each program.
    geneset : str
        Name of the set to run GSEA on.
    n_jobs : int
        Number of parallel jobs to run.
    low_cutoff : float
        Remove features with loadings at or below this value.
    n_top : int
        Take the top n features with the highest loadings.
        
    Returns
    -------
    pd.DataFrame
        DataFrame of GSEA results sorted by program name and FDR q-value. Includes the following columns:
        - program_name: name of the program
        - Term: gene set name
        - ES: enrichment score
        - NES: normalized enrichment score
        - NOM p-val: nominal p-value (from the null distribution of the gene set)
        - FDR q-val: adjusted False Discovery Rate
        - FWER p-val: Family wise error rate p-values
        - Gene %: percent of gene set before running enrichment peak (ES)
        - Lead_genes: leading edge genes (gene hits before running enrichment peak)
        - tag_before: number of genes in gene set
        - tag_after: number of genes matched to the data
    """

    # Run GSEA prerank for each column of loadings (each cell program)
    pre_res = pd.DataFrame()
    for i in tqdm(loadings.columns, desc='Running GSEA', unit='programs'):

        # Filter out low loadings
        temp_loadings = loadings[i][(loadings[i] > low_cutoff)]

        # Take top n features if specified
        if n_top is not None:
            temp_loadings = temp_loadings.sort_values(ascending=False).head(n_top)
            if len(temp_loadings) < n_top:
                logging.warning(f"Program {i} has less than {n_top} features after filtering. Only {len(temp_loadings)} features will be used.")

        # Run GSEA prerank
        temp_res = gp.prerank(
            rnk=temp_loadings, 
            gene_sets=geneset, 
            threads=n_jobs, 
            **kwargs
        ).res2d

        # Post-process results
        temp_res['Gene %'] = temp_res['Gene %'].apply(lambda x: float(x[:-1]))
        temp_res['tag_before'] = temp_res['Tag %'].apply(lambda x: int(x.split('/')[0]))
        temp_res['tag_after'] = temp_res['Tag %'].apply(lambda x: int(x.split('/')[1]))
        temp_res.drop(columns=['Tag %'], inplace=True)
        if 'Name' in temp_res.columns and temp_res['Name'][0] == "prerank":
            temp_res['Name'] = i
        temp_res.rename(columns={'Name': 'program_name'}, inplace=True)
        temp_res = temp_res.sort_values(['program_name', 'FDR q-val'])
        pre_res = pd.concat([pre_res, temp_res], ignore_index=True)
    
    return pre_res

@adamklie adamklie changed the title Be able to select a top n genes for a program and recompute evaluations Update GSEA to be able to take in N top genes or to threshold loadings before Sep 6, 2024
@adamklie adamklie self-assigned this Sep 6, 2024
@adamklie adamklie changed the title Update GSEA to be able to take in N top genes or to threshold loadings before Update gene set enrichment functionality Sep 10, 2024
@adamklie
Copy link
Collaborator Author

New fisher implementation:

def perform_fisher_enrich(
    loadings, 
    geneset, 
    n_top=500,
    **kwargs
):
    """Run Fisher enrichment on each gene program in the loadings matrix.

    Parameters
    ----------
    loadings : pd.DataFrame
        DataFrame of feature loadings for each program.
    geneset : dict
        Dictionary of gene sets.
    n_top : int
        Number of top features to take.
    
    Returns
    -------
    ['Gene_set', 'Term', 'P-value', 'Adjusted P-value', 'Odds Ratio',
       'Combined Score', 'Genes', 'program_name', 'overlap_numerator',
       'overlap_denominator'],
    pd.DataFrame
        DataFrame of Fisher enrichment results sorted by program name and adjusted p-value. Includes the following columns:
        - program_name: name of the program
        - Term: gene set name
        - P-value: Fisher's exact test p-value
        - Adjusted P-value: adjusted p-value
        - Odds Ratio: odds ratio
        - Combined Score: combined score
        - Genes: genes in the gene set
        - overlap_numerator: number of overlapping genes
        - overlap_denominator: number of genes in the gene set
    TODO
    ----
    - Parallelize across programs
    """

    # Find the intersection of genes present in the mudata object and in the library
    background_genes = set(value for sublist in geneset.values() for value in sublist)
    
    enr_res = pd.DataFrame()
    for i in tqdm(loadings.columns, desc='Running Fisher enrichment', unit='programs'):
        gene_list = list(loadings[i].sort_values(ascending=False).head(n_top).index)
        temp_res = gp.enrich(
            gene_list=gene_list,
            gene_sets=geneset, 
            background=background_genes
        ).res2d
        temp_res["program_name"] = i
        enr_res = pd.concat([enr_res, temp_res], ignore_index=True)
    enr_res['overlap_numerator'] = enr_res['Overlap'].apply(lambda x: int(x.split('/')[0]))
    enr_res['overlap_denominator'] = enr_res['Overlap'].apply(lambda x: int(x.split('/')[1]))
    enr_res.drop(columns=['Overlap', 'Gene_set'], inplace=True)
    enr_res = enr_res[["program_name"] + [col for col in enr_res.columns if col != "program_name"]]
    enr_res = enr_res.sort_values(['program_name', 'Adjusted P-value'])
    
    return enr_res

@adamklie
Copy link
Collaborator Author

Proposed column renames:

if gene_set_enrichment_config["method"] == "fisher":
        pre_res = pre_res.rename(columns={"Term": "term", "P-value": "pval", "Adjusted P-value": "adj_pval", "Odds Ratio": "enrichment", "Genes": "genes"})
    elif gene_set_enrichment_config["method"] == "gsea":
        pre_res = pre_res.rename(columns={"Term": "term", "NOM p-val": "pval", "FDR q-val": "adj_pval", "NES": "enrichment", "Lead_genes": "genes"})

@adamklie adamklie added this to the 2024 Gene Program Jamboree milestone Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request evaluations
Projects
None yet
Development

No branches or pull requests

1 participant