DRAGON is a software package to enable De novo, and RAtional prediction of Genome organizatiON. It provides an implementation of the model proposed in the manuscript to simulate chromatin structure and dynamics. With DRAGON, one can predict cell type specific chromosome structures using genome-wide profiles of histone modifications and CTCF molecules.
The package is mainly written in Python, and it streamlines all the necessary steps to process epigenomics data, to perform molecular dynamics simulations and to analyze predicted conformational ensemble for the chromatin.
DRAGON requires a single cpu (serial) or multiple cpus (parallel) with standard RAM to run molecular dynamics simulations. For a decent level of preformance, we recommend computing node with 14+ cores, 2.6+ GHz/core. All simulations in our manuscript are performed on an Intel Xeon E5-2690 v4 2.6Ghz node with 14 cores.
The package is supported for Linux operating systems. It had been tested on the system Linux CentOS 6.8.
DRAGON can be installed by running the following command:
git clone https://github.com/ZhangGroup-MITChemistry/DRAGON.git
or by downloading the zip file with the link:
https://github.com/ZhangGroup-MITChemistry/DRAGON/archive/master.zip
DRAGON uses LAMMPS, a molecular dynamics software package, for simulating chromatin structures. LAMMPS, together with our custom modifications, can be compiled and installed with the following command:
./src/LAMMPS.sh
Note that the GCC compiler needs to be installed beforehand and an environment of OpenMPI is needed to compile the parallel version of LAMMPS.
On a "normal" desktop computer, the typical install time including download and compile steps is ~ 5 minutes.
DRAGON models the chromatin as a coarse-grained bead-spring polymer, with each bead corresponding to a five kb genomic segment. This coarse-grained polymer is made cell type and chromosome specific by assigning each bead with a chromatin state. The polymer bead will also be labeled as an orientation dependent CTCF binding site if there is a strong binding signal in the corresponding region. With its underlying biochemistry specified, the structure of the chromatin can be predicted by simulating the sequence-specific potential energy function parameterized in our manuscript using LAMMPS. See the flowchart below for an illustration of the different steps for chromatin structure prediction.
We further provide step-by-step instructions below to simulate the structure of a 25 MB long region of chromosome 1 from GM12878 cells. All the executable scripts are provided in the ./example/
folder.
To simulate chromosome regions from other cell types with different lengths, we recommend using the python script that is more comprehensive. It streamlines all the steps below together and can launch multiple parallel simulations. See README for its detailed usage.
Before starting any structure predictions, we need to learn the chromatin states from genome-wide histone modification profiles and identify the genomic location and orientation of CTCF binding sites.
./example/1-processEpigenomicsData.sh
This script provides detailed instructions on how to process epigenomics data using ChromHMM and custom python scripts.
To start simulating chromatin structures, the following steps are necessary in order to incorporate the processed epigenomics input into data formats recognized by LAMMPS.
First, one needs to select a 25Mb long chromatin region of interest (the default setting for this example is chr1:20-45Mb from GM12878 cells) by running the following script
./example/2-selectChromatinRegion.sh
This script produces a txt file that lists the region of interested in the following format:
chromosome_id start_position(Mb) end_position(Mb)
{"1": [20, 45]}
If a different chromatin region is desired, one can either directly modify the generated chromatin region txt file or modified the parameters in the original script ./example/2-selectChromatinRegion.sh
and then regenerate the file.
Second, one needs to extract chromatin states and CTCF binding sites for the selected chromatin region from results produced in section I).
./example/3-extractEpigenomicsInput.sh
Three txt files are generated to provide the chromatin state of each polymer bead, the CTCF binding potency of each polymer bead, and the location of nearest CTCF binding sites for each polymer bead. See Chromatin States README and CTCF-binding Sites README for details on file formats and data extraction.
Third, one needs to incorporate the epigenomic inputs produced from the step above into file formats recognized by LAMMPS for molecular dynamics simulation.
./example/4-buildLammpsInput.sh
This script produces a topology file that stores the Cartesian coordinates of each polymer beads and the connectivity among polymer beads, an input file that instructs the specifics of the molecular dynamics simulation, and a bash script to execute LAMMPS.
With all the preparation steps above, we are finally ready to predict chromatin structures by running the following script:
./example/5-runMD.sh
Simulated chromatin structures will be stored in a binary file (DUMP_FILE.dcd) in the folder ./runMolecularDynamics/run_folder/Gm12878/chr1/run00/
. See below on how to use VMD to read this binary file and visualize chromatin structures. Note that this script only runs one molecular dynamics simulation using one CPU. We typically perform multiple simulations to improve conformational sampling and conduct each simulation with multiple CPUs to reduce simulation time. See the python program on how to setup multiple parallel simulations.
We use MATLAB to analyze contact maps and VMD to visualize chromatin structures. Installation of these two software packages is highly recommended. See contact map README and structure visualization README for detailed instructions on usage.
To quantitatively compare predicted chromatin structures with genome-wide chromosome conformation capture experiments (Hi-C), one needs to first calculate the contact probability map by running the script:
./example/6-calcCMAP.sh
The core program to calculate the contact map from trajectory files is written in FORTRAN and has been precompiled with ifort compiler. One can also modify the script to compile it with gfortan if ifort is not available.
The calculated contact map is located at ./analyzeChromatinConformation/contactMap/cmap/
. To visualize the contact map, use the provided MATLAB script. See README for more detailed instructions.
To render the predicted chromatin structures in 3D, run the following script:
./example/7-genVMDScript.sh
The script produces a VMD file written in TCL language that can be loaded into the software VMD to visualize chromatin structures. See README for detailed instruction of visualization steps with VMD.