Skip to content

Latest commit

 

History

History
555 lines (434 loc) · 22.5 KB

README.md

File metadata and controls

555 lines (434 loc) · 22.5 KB

CombSAFE

About

This repository contains the implementation of the Combinatorial and Semantic Analysis of Functional Elements (CombSAFE) presented in: Leone M, Galeota E, Masseroli M, Pellizzola M. "Identification, semantic annotation and comparison of combinations of functional elements in multiple biological conditions", 2021. It is a flexible computational method to identify combinations of static and dynamic functional elements genome-wide, or in a specific genomic region, and how they change across semantically annotated biological conditions.

Given as input a set of ChIP-seq dataset samples and the list of functional elements to be considered, the CombSAFE pipeline allows:

  • considering histone modifications, transcription factors and any other type of dynamic or static genomic features (e.g., CpG islands, partially methylated domains, transposable elements, etc.)
  • relying on public repositories to retrieve the considered static functional elements and identify from the input dataset metadata the biological conditions in which the dynamic functional elements of interest were charted
  • leveraging natural language processing techniques and biomedical ontologies to complement the identified conditions with semantic annotations about tissue and disease types
  • identifying combinations of static and dynamic functional elements throughout the genome in the corresponding omics data, using hidden Markov models
  • focusing on specific genomic regions, applying clustering to explore how significant combinations of the functional elements compare among semantically annoted cell types and desease/healthy conditions
  • performing functional enrichment analyses based on the genes found in genomic regions with similar combinations of functional elements.

Structure

CombSAFE/
|-- README.md
|-- LICENSE
|-- .gitignore
|-- notebook/
|   |-- Functional_states_analysis.ipynb
|-- gene_list/
|   |-- MYC_associated.txt
|   |-- test_list.txt
|   |-- tumor_suppressor.txt
|-- CombSAFE/
|   |-- CombSAFE.py
|-- CombSAFE.yml
  • README.md this file
  • LICENSE MIT license file
  • .gitignore standard .gitignore file for Python projects
  • notebook/Functional_states_analysis.ipynb Python notebook to run a functional state analysis
  • gene_list/ folder where gene name lists are stored for the CombSAFE single gene analysis
  • gene_list/test_list.txt list of random genes
  • gene_list/tumor_suppressor.txt list of tumor suppressor genes
  • gene_list/MYC_associated.txt list of MYC interacting genes
  • CombSAFE/CombSAFE.py core Python routines called from within the notebook to perform the functional state analysis
  • CombSAFE.yml dependency yaml file to load in order to perform the CombSAFE analysis

How to install

In order to run the CombSAFE pipeline, please follow the steps below:

  1. Install the Anaconda package and environment manager from here
  2. Load the CombSAFE environment with the command: conda env create -f CombSAFE.yml
  3. Activate the CombSAFE environment with the command: conda activate CombSAFE ON Linux and macOS. On Windows systems digit activate CombSAFE
  4. Run the notebook/Functional_states_analysis.ipynb

NB: The pyGMQL package additionally requires Java. Please follow the installation procedure here.
NB2: The PyEnsembl package additionally requires Ensembl data. Please follow the installation procedure here.
NB3: For Windows users, Visual Studio v.14 or higher is required. Please follow the installation procedure here

Cookbook

In the following, we show how to call the functions implemented to easily perform the different steps of our CombSAFE computational method, providing example resuls for some of them.

Generate input dataset from raw data

combsafe.create_dataset(path, organism, threads=4, from_GEO=False)
Run a ChIP-seq peak calling pipeline from input raw data.

For single-end reads Input files must be structured as follows:

Input_folder/
|-- Raw_Reads/
|   |-- 1.rawreads.fastq
|   |-- 2.rawreads.fastq
|   |-- 3.rawreads.fastq
|   |-- 4.rawreads.fastq
|   |-- 5.rawreads.fastq
|   |-- 6.rawreads.fastq
|   |-- 7.rawreads.fastq
|   |-- ...
|-- Textual_file.txt
  • Raw_Reads a folder containing raw reads in fastq format
  • Textual_file.txt a text file containing the following information:
    • file, filename of the corresponding raw reads file in the Raw_Reads folder
    • factor, transcription factor or histone mark used for the analysis
    • description, all available iformations of the biological source from which to extract terms for semantic annotations.

E.g.:

file factor description
1.rawreads.fastq CTCF low passage primary melanoma cultures
2.rawreads.fastq H3K4me3 Bone marrow mononuclear cells
3.rawreads.fastq MYC human primary monocytes isolated from PBMC of healthy donors
... ... ...

For paired-end reads Input files must be structured as follows:

Input_folder/
|-- Raw_Reads/
|   |-- 1.forward_reads.fastq
|   |-- 1.reverse_reads.fastq
|   |-- 2.forward_reads.fastq
|   |-- 2.reverse_reads.fastq
|   |-- 3.forward_reads.fastq
|   |-- 3.reverse_reads.fastq
|   |-- ...
|-- Textual_file.txt
  • Raw_Reads a folder containing raw reads in fastq format
  • Textual_file.txt a text file containing the following information:
    • file_1, filename of the corresponding forward raw reads file in the Raw_Reads folder
    • file_2, filename of the corresponding reverse raw reads file in the Raw_Reads folder
    • factor, transcription factor or histone mark used for the analysis
    • description, all available informations of the biological source from which to extract terms for semantic annotations.

E.g.:

file_1 file_2 factor description
1.forward_reads.fastq 1.reverse_reads.fastq H3K4me1 Human embryonic stem cells received from the John Doe laboratory
2.forward_reads.fastq 2.reverse_reads.fastq H3K4me3 Nuclei derived from crude preps of adipose tissue
3.forward_reads.fastq 3.reverse_reads.fastq H3K27me3 Monocyte-derived macrophage
... ... ... ...

If you want to start a functional state analysis on GEO experiments, set the from_GEO label as True. In that scenario, input files must be structured as follows:

Input_folder/
|-- Textual_file.txt
  • Textual_file.txt a text file containing the following information:
    • sample_id, Id of the samples on GEO
    • factor, transcription factor or histone mark used for the analysis

E.g.,

sample_id factor
GSM648494 H3K4me1
GSM648495 H3K4me3
GSM575280 H3K27me3
... ...

Parameters:

  • path: path object or file-like object
    • input path folder
  • organism: string
    • reference genome assembly (e.g., "hg19", "hg38", "mm10", "mm39", "rn7", "danrer11", "dm6", "ce11", etc...)
  • threads: int, default 4
    • number of threads for the ChIp-seq pipeline.
  • from_GEO: bool, default False
    • if True, CombSAFE downloads raw reads and metadata from the GEO web page of the input GEO GSM Ids

Example:

>> dataset = create_dataset("./Input_folder/", assembly="hg38", threads=20, from_GEO=False)

Load input dataset

combsafe.load_dataset(path, assembly, from_GEO=False)
Load the input path for the analysis. Skip this step if you have already generated the daatset.
Input files must be structured as follows:

Input_folder/
|-- Chip_Files/
|   |-- 1.narrowPeak.bed
|   |-- 2.broadPeak.bed
|   |-- 3.narrowPeak.bed
|   |-- 4.broadPeak.bed
|   |-- 5.narrowPeak.bed
|   |-- 6.broadPeak.bed
|   |-- 7.narrowPeak.bed
|   |-- ...
|-- Textual_file.txt
  • Chip_Files a folder containing ChIP-Seq files
  • Textual_file.txt a text file containing the following information:
    • sample_id, univolcal id for each sample
    • factor, Transcription Factor or Histone Mark used for the analysis
    • file, filename of the corresponding ChIp-seq file in the Chip_Files folder
    • description, all available informations of the biological source from which to extract terms for semantic annotations.

E.g.:

sample_id factor file description
1 CTCF 1.narrowPeak.bed low passage primary melanoma cultures
2 H3K4me3 2.narrowPeak.bed Bone marrow mononuclear cells
3 MYC 3.narrowPeak.bed human primary monocytes isolated from PBMC of healthy donors
... ... ...

If your dataset is generate from GEO samples and you want to get the description from the GSM GEO webpage, set the from_GEO label as True. In that scenario, Textual_file.txt must be structured as follows:

  • Textual_file.txt a text file containing the following information:
    • sample_id, Id of the samples on GEO
    • factor, Transcription Factor or Histone Mark used for the analysis
    • file, filename of the corresponding ChIp-seq file in the Chip_Files folder
sample_id factor file
GSM648494 H3K4me1 1.narrowPeak.bed
GSM648495 H3K4me3 2.broadPeak.bed
GSM575280 H3K27me3 3.narrowPeak.bed
... ... ...

Parameters:

  • path: path object or file-like object
    • input path folder
  • organism: string
    • reference genome assembly (e.g., "hg19", "hg38", "mm10", "mm39", "rn7", "danrer11", "dm6", "ce11", etc...)
  • from_GEO: bool, default False
    • if True, CombSAFE downloads raw reads and metadata from the GEO web page of the input GEO GSM Ids

Example:

>> input_path = import_path("./Input_folder/", assembly="hg38", from_GEO=True)

Generate semantic annotations

combsafe.generate_semantic_annotations(dataset, ontology_1, ontology_2, disease = False, encode_convert=False)
Generate semantic annotations about tissue and disease types from the input dataset.

Parameters:

  • dataset dataset object
    • imported dataset object
  • ontology_1: str
    • url address of the first selected ontology
  • ontology_2: str
    • url address of the second selected ontology
  • disease bool, default False
    • set True if one of the selected ontologies is involved in disease concepts
  • encode_convert bool, default False
    • if true, encode IDs are searched to be converted to GSM

Returns:

  • semantic_dataframe
    • dataset of biological conditions from the input metadata

Example:

>> ontology_1 = "https://raw.githubusercontent.com/obophenotype/cell-ontology/master/cl-basic.obo"
>> ontology_2 = "https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/doid.obo"
>> semantic_df = generate_semantic_annotations(dataset, ontology_1, ontology_2, disease = True, encode_convert=False)

Data analysis

combsafe.plot_factor_freq(semantic_dataframe, n)
Vertical barplot of the factor frequency in the input dataset.

Parameters:

  • semantic_dataframe: dataframe
    • dataset of biological conditions from the input metadata
  • n: int
    • number of factors to display in the barplot

Example:

>> plot_factor_freq(semantic_df, 30)

alt text


combsafe.generate_fixed_factor_pool(semantic_dataframe, factor_list, number_of_semantic_annotations)
Table containg lists of factors according to the selected parameters.

Parameters:

  • semantic_dataframe: dataframe
    • dataset of biological conditions from the input metadata
  • factor_list: list
    • list of factors to include in the analysis
  • number_of_semantic_annotations: int
    • number of semantic annotations to include in the analysis

Example:

>> generate_fixed_factor_pool(semantic_df, ["CTCF", "MYC"], 5)

alt text


combsafe.get_semantic_annotation_list(semantic_dataframe, factor_list)
List of semantic annotations according to the selected factors.

Parameters:

  • semantic_dataframe: dataframe
    • dataset of biological conditions from the input metadata
  • factor_list: list
    • list of factors to include in the analysis

Example:

>> get_semantic_annotation_list(semantic_df, ["CTCF", "MYC", "POLR2A", "H3K4me3", "H3K27me3"])

alt text


Data extraction and replica combination

combsafe.extract_data(factor_list)
Combine sample replicas of the listed factors and extract their semantic annotations regarding the conditions in which they were mapped.

Parameters:

  • factor_list: list
    • list of factors to include in the analysis

Example:

>> extract_data(["CTCF", "MYC", "POLR2A", "H3K4me3", "H3K27me3"])

[Optional] Add custom tracks

combsafe.add_custom_tracks(tracks_label, path_to_custom_tracks, index)
Add custom tracks of static genomic elements to the analysis (e.g., CpG islands).

Parameters:

  • tracks_label: string
    • name of the custom tracks to add for the analysis
  • path_to_custom_tracks: path
    • UCSC path for downloading custom tracks

Example:

>> add_custom_tracks("CpG_Islands", "./input_files/cpgIslandExt.txt")

Identification of combinations of genomic functional elements

combsafe.identify_functional_states(ChromHMM_path, number_of_states, n_core)
identification of combinations of static and dynamic functional elements throughout the genome.

Parameters:

  • ChromHMM_path: path
    • path to the chromHMM software folder. It can be downloaded here.
  • number_of_states: int
    • number of combinations of genomic functional elements
  • n_core: int
    • number of cores to use for the analysis

Returns:

  • functional_states_dataframe
    • dataset of functional states for each biological conditions
>> functional_states_df = identify_functional_states(chromhmm_path ="./ChromHMM/", number_of_states = 15, n_core = 20)

Alternatively, it is possible to load in house segmentated files from an other HMM segmentation tool and jump to the next step

combsafe.load_custom_segments(input_segment_dir, num_states)
load functional states files from input path.

Parameters:

  • input_segment_dir: path
    • path to the segmentated file folder.
  • number_of_states: int
    • number of combinations of genomic functional elements

Returns:

  • functional_states_dataframe
    • dataset of functional states for each biological conditions

Example:

>> functional_states_df = load_custom_segments(input_segment_dir ="./Input_folder/in_house_segmentated/", num_states=15)

Input files must be structured as follows:

Input_folder/
|-- Segmentated_Files/
|   |-- 1.semantic_annotation_15_segments.bed
|   |-- 2.semantic_annotation_15_segments.bed
|   |-- 3.semantic_annotation_15_segments.bed
|   |-- 4.semantic_annotation_15_segments.bed
|   |-- 5.semantic_annotation_15_segments.bed
|   |-- 6.semantic_annotation_15_segments.bed
|   |-- 7.semantic_annotation_15_segments.bed
|   |-- ...
|-- emissions.txt
  • Segmentated_Files a folder containing raw reads in fastq format
  • emissions.txt a text file structured as follows:
State H3K4me3 POLR2A MYC H3K27me3
1 0.093326 0.457892 0.143540 0.924645
2 0.793153 0.658634 0.972344 0.487613
3 0.940996 0.000234 0.243758 0.187461
4 0.143540 0.763471 0.872346 0.104765
... ... ... ... ...

combsafe.show_emission_graph(custom_palette=colors)
Show emission parameters heatmap of genome functional states combination.

Parameters:

  • custom_palette: list of exadecimals, default=None
    • add a list of customized colors in hexadecimal form to be assigned to the functional states.

Example:

>> colors = ['#c9f9ff', '#e6beff', '#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231','#911eb4', '#bcf60c', '#f032e6', '#fffac8', '#fabebe', '#9a6324', '#46f0f0', '#008080']
>> show_emission_graph(custom_palette=colors)

alt text


Distance Metric Heatmap

combsafe.show_distance_matrix()
Show distance matrix heatmap about functional states generated from the emission parameters file of an HMM model.

Example:

>> show_distance_matrix()

alt text


Single-gene analysis

combsafe.single_gene_analysis(functional_states_dataframe, path_to_gene_list_file, distance_metric )
Given a list of gene symbols in a textual file, the heatmap of the functional states of the related genomic regions is shown.

Parameters:

  • functional_states_dataframe: dataframe
    • dataframe of functional states
  • path_to_gene_list_file: path
    • path to the gene list file
  • distance_metric: function/string
    • distance metric to use for the data. CombSAFE auto-generate from the input emission.txt file a special metric functional_state_distance to weight distances among functional states according to the their function. Alternatively, hamming distance can be selected

Example:

>> single_gene_analysis(functional_states_df, "path_to_gene_list/gene_list.txt", distance_metric = funtional_states_distance)
>> single_gene_analysis(functional_states_df, "path_to_gene_list/gene_list.txt", distance_metric = "hamming")

alt text


PCA Analysis

combsafe.show_distance_matrix()
Shows PCA heatmap among semantic annotation for selected components.

Parameters:

  • functional_states_dataframe: dataframe
    • dataframe of functional states
  • number_of_components: int
    • number of components for the principal component analysis

Return:

  • loadings: dataframe
    • PCA loadings

Example:

>> pca_analysis(functional_states_df, 10)

alt text


Genome-wide analysis

combsafe.genome_reduction(functional_states_dataframe, threshold)
Reduce the initial functional state dataframe to visualize the functional states of the various semantic annotations in the form of a heatmap.
NB: the proportions among the functional states are maintained as in the previous dataframe of functional states.

Parameters:

  • functional_states_dataframe: dataframe
    • dataframe of functional states
  • threshold: int
    • value adopted to avoid the over-representation of states that are specific for a small subset of the conditions or for one condition only. Select 100 for full representation

Return:

  • reducted_dataframe: dataframe
    • reducted dataframe of functional states

Example:

>> reducted_df = genome_reduction(functional_states_df, threshold=90)

combsafe.data_driven_heatmap(reducted_df, distance_metric, min_clust_size, min_sampl)
Show a genome-wide heatmap with the most significant clusters of genomic regions based on their patterns of functional states.

Parameters:

  • reducted_df: dataframe
    • reducted dataframe of functional states
  • distance_metric: function/string
    • distance metric to use for the data. CombSAFE auto-generate from the input emission.txt file a special metric functional_state_distance to weight distances among functional states according to the their function. Alternatively, hamming distance can be selected
  • min_clust_size: int
    • minimum number of clusters accetpted for the analysis
  • min_sampl: int
    • minimum number of samples per cluster accepted for the analysis

Return:

  • clustered_dataframe: dataframe
    • dataframe of functional states ordered according to the cluster parameters

Example:

>> clustered_heatmap = data_driven_heatmap(reducted_df, functional_states_distance, min_clust_size=10, min_sampl=2)

alt text


combsafe.gene_ontology_enrichment_analysis(clustered_dataframe, distance_metric, goea_tool)
Show a genome-wide heatmap with the most significant clusters of genomic regions based on their patterns of functional states.

Parameters:

  • clustered_dataframe: dataframe
    • adataframe of functional states ordered according to the cluster parameters
  • distance_metric: function/string
    • distance metric to use for the data. CombSAFE auto-generate from the input emission.txt file a special metric functional_state_distance to weight distances among functional states according to the their function. Alternatively, hamming distance can be selected
  • goea_tool: str, "great" or "goatool"
    • tool for gene ontology enrichment analysis

Example:

>> gene_ontology_enrichment_analysis(clustered_heatmap, goea_tool = "great", distance_metric=functional_states_distance)