There is a number of tools to find peaks from different file formats. Still there was no suitable tool to extract the peaks from a file with uncontinuous signal. Therefore the footprints_extraction.py was developed. The signal in the input file in .bigWig format is used to estimate peaks or in other words footprints. A sliding window algorithm is used to look through the signal from the input file. A region with significant high signal will be saved as a footprint. To estimate the significance of a footprint the signal is compared with the mean of the signals within the window. After the run the footprints are saved in a .bed file.
Please mention that CONDA is needed for the installation of the Footprints extraction working environment.
- Clone the directory
git clone https://github.com/petrokvitka/Footprints_extraction
- Switch to the directory
cd Footprints_extraction
- Create the needed environment from the file fp_extract.yaml
conda env create --file fp_extract.yaml
- Activate the environment
source activate fp_extract
There are two required input files for the peak calling. They are:
- --bigwig a bigWig file with the signal;
- --bed a corresponding .bed file with peaks of interest.
The user can set the name for the output .bed file using the parameter --output_file. By default the output file is called footprints_extraction_output.bed and is saved to the working directory. While using the whole pipeline, the default name will be used for the output file to save it as an intermediate result.
For each footprint an individual score is found. The score shows the quality of the footprint and is found as the mean of the signals from bigWig file for the corresponding positions. The output .bed file has 8 columns and a header marked with #. The header identifies the names for all columns:
#chr start end name score len max_pos bonus_info
- chr for the chromosom;
- start for the start position of the footprint;
- end for the end position of the footprint;
- name for the unique name of the footprint;
- score for the quality of the footprint;
- strand for the strand of the footprint;
- len for the length of the footprint;
- max_pos for the relative position within the footprint, where the highest signal from the bigWig file is (if there are several positions with the highest score the position in the middle of those will be set as the max_pos);
- bonus_info for the additional information from the input .bed file.
If some information could not be defined, the point . will be written in the corresponding row and column. For example, if there was no bonus_info provided from the input .bed file, the point . could be found at the output file of the footprints_extraction.
During the run, a log-file is written. The log-file is called footprints_extraction.log and is saved to the working directory as well. If the user does not want to receive the messages about the run in the terminal, the parameter --silent can be used to force the script write the messages only to the log-file.
There are also optional parameters that could be set to refine the search:
- --window_length. This parameter sets the length of a sliding window. By default the length 200 bp was taken as the smallest peak of the test data was 200 bp long.
- --step. This parameter sets the number of positions to slide the window forward. By default the step of a length 100 bp was taken.
- --percentage. By default each signal from the bigWig file is compared to the threshold which is the mean of signals within a window (or within a peak, if it is smaller than the chosen window_length). Though the user has a possibility to move this threshold using the parameter --percentage. For example --percentage 10 will add 10% of the found threshold within a window and set it as a new threshold to compare the signal to. The default percentage is set to 0.
- --min_gap. This parameter sets the minimal allowed distancee between two neighbour footprints. All footprints that are nearer to each other than this distance will be merged. By default the min_gap is set to 6 bp.
Changing the optional parameters can lead to varying the number of found footprints and their length. The smaller the --percentage is set, the longer the found footrpints will be, but the quality of each footprint will be lower. The length of a window has also an impact on the length of found footprints. The bigger the --window_len parameter ist set, the longer the found footprints will be, though this change is not as remarkable as while changing the --percentage parameter.
There is a possibility for a user to set the max allowed number of base pairs in between two footprints with the help of a parameter --min_gap. By default 6 bp are allowed. That means, all footprints, that have a smaller number of footprints in between, will be merged. The score for a merged footprint is calculated as a mean of scores of both original footprints. If the user doesn't want to merge the footprints, the --min_gap should be set to 1.
The files in example are based on the Buenrostro's research. The file in .bigWig format represents the uncontinuous signal from the ATAC-seq and the file in .BED format contains the corresponding peaks. The run with example data can be started using following command:
python footprints_extraction.py --bigwig example/buenrostro50k_chr1_fp.bw --bed example/buenrostro50k_chr1_peaks.bed --output_file example_output/output.bed
The output file containing the footprints, as well as the logfile can be found in the example_output folder.
There is still some work to do to improve the readability and quality of the code, and clean the code in overall a little bit.