Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR-1: Hydra-Powered YAML Configuration and run as Pip module #325

Open
wants to merge 6 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 45 additions & 54 deletions apps/protein_folding/helixfold3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,17 +44,26 @@ Locate to the directory of `helixfold` then run:
```bash
# Install py env
conda create -n helixfold -c conda-forge python=3.9
conda install -y -c bioconda aria2 hmmer==3.3.2 kalign2==2.04 hhsuite==3.3.0 -n helixfold
conda install -y -c conda-forge openbabel -n helixfold

# activate the conda environment
conda activate helixfold

# adjust these version numbers as your situation
conda install -y cudnn=8.4.1 cudatoolkit=11.7 nccl=2.14.3 -c conda-forge -c nvidia
conda install -y -c bioconda hmmer==3.3.2 kalign2==2.04 hhsuite==3.3.0
conda install -y -c conda-forge openbabel

# install paddlepaddle
python3 -m pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install paddlepaddle-gpu==2.6.1.post120 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# or lower version: https://paddle-wheel.bj.bcebos.com/2.5.1/linux/linux-gpu-cuda11.7-cudnn8.4.1-mkl-gcc8.2-avx/paddlepaddle_gpu-2.5.1.post117-cp39-cp39-linux_x86_64.whl

python3 -m pip install -r requirements.txt
# downgrade pip
pip install --upgrade 'pip<24'

# edit configuration file at `/helixfold/config/helixfold.yaml` to set your databases and binaries correctly

# install HF3 as a python library
pip install . --no-cache-dir
```

Note: If you have a different version of python3 and cuda, please refer to [here](https://www.paddlepaddle.org.cn/whl/linux/gpu/develop.html) for the compatible PaddlePaddle `dev` package.
Expand Down Expand Up @@ -125,58 +134,40 @@ sh run_infer.sh
```

The script is as follows,
```bash
#!/bin/bash

PYTHON_BIN="PATH/TO/YOUR/PYTHON"
ENV_BIN="PATH/TO/YOUR/ENV"
MAXIT_SRC="PATH/TO/MAXIT/SRC"
DATA_DIR="PATH/TO/DATA"
export OBABEL_BIN="PATH/TO/OBABEL/BIN"
export PATH="$MAXIT_BIN/bin:$PATH"

CUDA_VISIBLE_DEVICES=0 "$PYTHON_BIN" inference.py \
--maxit_binary "$MAXIT_SRC/bin/maxit" \
--jackhmmer_binary_path "$ENV_BIN/jackhmmer" \
--hhblits_binary_path "$ENV_BIN/hhblits" \
--hhsearch_binary_path "$ENV_BIN/hhsearch" \
--kalign_binary_path "$ENV_BIN/kalign" \
--hmmsearch_binary_path "$ENV_BIN/hmmsearch" \
--hmmbuild_binary_path "$ENV_BIN/hmmbuild" \
--nhmmer_binary_path "$ENV_BIN/nhmmer" \
--preset='reduced_dbs' \
--bfd_database_path "$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt" \
--small_bfd_database_path "$DATA_DIR/small_bfd/bfd-first_non_consensus_sequences.fasta" \
--bfd_database_path "$DATA_DIR/small_bfd/bfd-first_non_consensus_sequences.fasta" \
--uniclust30_database_path "$DATA_DIR/uniclust30/uniclust30_2018_08/uniclust30_2018_08" \
--uniprot_database_path "$DATA_DIR/uniprot/uniprot.fasta" \
--pdb_seqres_database_path "$DATA_DIR/pdb_seqres/pdb_seqres.txt" \
--uniref90_database_path "$DATA_DIR/uniref90/uniref90.fasta" \
--mgnify_database_path "$DATA_DIR/mgnify/mgy_clusters_2018_12.fa" \
--template_mmcif_dir "$DATA_DIR/pdb_mmcif/mmcif_files" \
--obsolete_pdbs_path "$DATA_DIR/pdb_mmcif/obsolete.dat" \
--ccd_preprocessed_path "$DATA_DIR/ccd_preprocessed_etkdg.pkl.gz" \
--rfam_database_path "$DATA_DIR/Rfam-14.9_rep_seq.fasta" \
--max_template_date=2020-05-14 \
--input_json data/demo_protein_ligand.json \
--output_dir ./output \
--model_name allatom_demo \
--init_model ./init_models/checkpoints.pdparams \
--infer_times 3 \
--precision "fp32"
##### Run from default config
```shell
LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH \
helixfold \
input=/repo/PaddleHelix/apps/protein_folding/helixfold3/data/demo_8ecx.json \
output=. CONFIG_DIFFS.preset=allatom_demo
```

##### Run with customized configuration dir and file(`./myfold.yaml`, for example):
```shell
LD_LIBRARY_PATH=$CONDA_PREFIX/lib/:$LD_LIBRARY_PATH \
helixfold --config-dir=. --config-name=myfold \
input=/repo/PaddleHelix/apps/protein_folding/helixfold3/data/demo_6zcy_smiles.json \
output=. CONFIG_DIFFS.preset=allatom_demo
```

##### Run with additional configuration term
```shell
LD_LIBRARY_PATH=/mnt/data/envs/conda_env/envs/helixfold/lib/:$LD_LIBRARY_PATH \
helixfold \
input=/repo/PaddleHelix/apps/protein_folding/helixfold3/data/demo_6zcy.json \
output=. \
CONFIG_DIFFS.model.heads.confidence_head.weight=0.01 \
CONFIG_DIFFS.model.global_config.subbatch_size=192
```

The descriptions of the above script are as follows:
* Replace `MAXIT_SRC` with your installed `maxit`'s root path.
* Replace `DATA_DIR` with your downloaded data path.
* Replace `OBABEL_BIN` with your installed `openbabel` path.
* Replace `ENV_BIN` with your conda virtual environment or any environment where `hhblits`, `hmmsearch` and other dependencies have been installed.
* `--preset` - Set `'reduced_dbs'` to use small bfd or `'full_dbs'` to use full bfd.
* `--*_database_path` - Path to datasets you have downloaded.
* `--input_json` - Input data in the form of JSON. Input pattern in `./data/demo_*.json` for your reference.
* `--output_dir` - Model output path. The output will be in a folder named the same as your `--input_json` under this path.
* `--model_name` - Model name in `./helixfold/model/config.py`. Different model names specify different configurations. Mirro modification to configuration can be specified in `CONFIG_DIFFS` in the `config.py` without change to the full configuration in `CONFIG_ALLATOM`.
* `--infer_time` - The number of inferences executed by model for single input. In each inference, the model will infer `5` times (`diff_batch_size`) for the same input by default. This hyperparameter can be changed by `model.head.diffusion_module.test_diff_batch_size` within `./helixfold/model/config.py`
* `--precision` - Either `bf16` or `fp32`. Please check if your machine can support `bf16` or not beforing changing it. For example, `bf16` is supported by A100 and H100 or higher version while V100 only supports `fp32`.
* `LD_LIBRARY_PATH` - This is required to load the `libcudnn.so` library if you encounter issue like `RuntimeError: (PreconditionNotMet) Cannot load cudnn shared library. Cannot invoke method cudnnGetVersion.`
* `config-dir` - The directory that contains the alterative configuration file you would like to use.
* `config-name` - The name of the configuration file you would like to use.
* `input` - Input data in the form of JSON. Input pattern in `./data/demo_*.json` for your reference.
* `output` - Model output path. The output will be in a folder named the same as your `--input_json` under this path.
* `CONFIG_DIFFS.preset` - Model name in `./helixfold/model/config.py`. Different model names specify different configurations. Mirro modification to configuration can be specified in `CONFIG_DIFFS` in the `config.py` without change to the full configuration in `CONFIG_ALLATOM`.
* `CONFIG_DIFFS.*` - Override model any configuration in `CONFIG_ALLATOM`.

### Understanding Model Output

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
import paddle
import itertools
import os
import subprocess

FeatureDict = Mapping[str, np.ndarray]
ModelOutput = Mapping[str, Any] # Is a nested dict.
Expand Down Expand Up @@ -164,14 +165,37 @@ def prediction_to_mmcif(pred_atom_pos: Union[np.ndarray, paddle.Tensor],
- maxit_binary: path to maxit_binary, use to convert pdb to cif
- mmcif_path: path to save *.cif
"""
assert maxit_binary is not None and os.path.exists(maxit_binary), (
if not os.path.isfile(maxit_binary):
raise FileNotFoundError(
f'maxit_binary: {maxit_binary} not exists. '
f'link: https://sw-tools.rcsb.org/apps/MAXIT/source.html')
assert mmcif_path.endswith('.cif'), f'mmcif_path should endswith .cif; got {mmcif_path}'

if not mmcif_path.endswith('.cif'):
raise ValueError(f'mmcif_path should endswith .cif; got {mmcif_path}')

pdb_path = mmcif_path.replace('.cif', '.pdb')
pdb_path = prediction_to_pdb(pred_atom_pos, FeatsDict, pdb_path)
msg = os.system(f'{maxit_binary} -i {pdb_path} -o 1 -output {mmcif_path}')
if msg != 0:
print(f'convert pdb to cif failed, error message: {msg}')

cmd=[maxit_binary,
'-i', pdb_path,
'-o', '1',
'-output', mmcif_path,
]

print('Launching subprocess "%s"', ' '.join(cmd))

process = subprocess.Popen(
cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, env=os.environ.copy())


stdout, stderr = process.communicate()
retcode = process.wait()


if retcode:
# Logs have a 15k character limit, so log HHblits error line by line.
print('maxit failed. HHblits stderr begin:')
raise RuntimeError('HHblits failed\nstdout:\n%s\n\nstderr:\n%s\n' % (
stdout.decode('utf-8'), stderr[:500_000].decode('utf-8')))

return mmcif_path
71 changes: 71 additions & 0 deletions apps/protein_folding/helixfold3/helixfold/config/helixfold.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
defaults:
- _self_

# General configuration

bf16_infer: false # Corresponds to --bf16_infer
seed: null # Corresponds to --seed
logging_level: DEBUG # Corresponds to --logging_level
weight_path: /mnt/db/weights/helixfold/HelixFold3-params-240814/HelixFold3-240814.pdparams # Corresponds to --init_model
precision: fp32 # Corresponds to --precision
amp_level: O1 # Corresponds to --amp_level
infer_times: 1 # Corresponds to --infer_times
diff_batch_size: -1 # Corresponds to --diff_batch_size
use_small_bfd: false # Corresponds to --use_small_bfd
msa_only: false # Only process msa

# File paths

input: null # Corresponds to --input_json, required field
output: null # Corresponds to --output_dir, required field
override: false # Set true to override existing msa output directory


# Binary tool paths, leave them as null to find proper ones under PATH or conda bin path
bin:
jackhmmer: null # Corresponds to --jackhmmer_binary_path
hhblits: null # Corresponds to --hhblits_binary_path
hhsearch: null # Corresponds to --hhsearch_binary_path
kalign: null # Corresponds to --kalign_binary_path
hmmsearch: null # Corresponds to --hmmsearch_binary_path
hmmbuild: null # Corresponds to --hmmbuild_binary_path
nhmmer: null # Corresponds to --nhmmer_binary_path
obabel: null # Inject to env as OBABEL_BIN

# Database paths
db:
uniprot: /mnt/db/uniprot/uniprot.fasta # Corresponds to --uniprot_database_path, required field
pdb_seqres: /mnt/db/pdb_seqres/pdb_seqres.txt # Corresponds to --pdb_seqres_database_path, required field
uniref90: /mnt/db/uniref90/uniref90.fasta # Corresponds to --uniref90_database_path, required field
mgnify: /mnt/db/mgnify/mgy_clusters.fa # Corresponds to --mgnify_database_path, required field
bfd: /mnt/db/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt # Corresponds to --bfd_database_path
small_bfd: /mnt/db/reduced_bfd/bfd-first_non_consensus_sequences.fasta # Corresponds to --small_bfd_database_path
uniclust30: /mnt/db/uniref30_uc30/UniRef30_2022_02/UniRef30_2022_02 # Corresponds to --uniclust30_database_path
rfam: /mnt/db/helixfold/rna/Rfam-14.9_rep_seq.fasta # Corresponds to --rfam_database_path, required field
ccd_preprocessed: /mnt/db/ccd/ccd_preprocessed_etkdg.pkl.gz # Corresponds to --ccd_preprocessed_path, required field

# Template and PDB information
template:
mmcif_dir: /mnt/db/pdb_mmcif/mmcif_files # Corresponds to --template_mmcif_dir, required field
max_date: '2023-03-15' # Corresponds to --max_template_date, required field
obsolete_pdbs: /mnt/db/pdb_mmcif/obsolete.dat # Corresponds to --obsolete_pdbs_path, required field

# Preset configuration
preset:
preset: reduced_dbs # Corresponds to --preset, choices=['reduced_dbs', 'full_dbs']

# Other configurations
other:
maxit_binary: /mnt/data/software/maxit/maxit-v11.100-prod-src/bin/maxit # Corresponds to --maxit_binary


# CONFIG_DIFFS for advanced configuration
CONFIG_DIFFS:
preset: null #choices=['null','allatom_demo', 'allatom_subbatch_64_recycle_1']
# model:
# global_config:
# subbatch_size: 96 # model.global_config.subbatch_size
# num_recycle: 3 # model.num_recycle
# heads:
# confidence_head:
# weight: 0.0 # model.heads.confidence_head.weight
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@
'seqs': ccd_seqs,
'msa_seqs': msa_seqs,
'count': count,
'extra_mol_infos': {} for which seqs has the modify residue type or smiles.
'extra_mol_infos': {}, for which seqs has the modify residue type or smiles.
"""
import collections
import copy
import gzip
import os
import json
import sys
import subprocess
import tempfile
import itertools
sys.path.append('../')
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
Expand Down Expand Up @@ -52,9 +52,7 @@
3: 'Unknown error.'
}

OBABEL_BIN = os.getenv('OBABEL_BIN')
if not os.path.exists(OBABEL_BIN):
raise FileNotFoundError(f'Cannot find obabel binary at {OBABEL_BIN}.')



def read_json(path):
Expand Down Expand Up @@ -144,6 +142,11 @@ def smiles_toMol_obabel(smiles):
"""
generate mol from smiles using obabel;
"""

OBABEL_BIN = os.getenv('OBABEL_BIN')
if not (OBABEL_BIN and os.path.isfile(OBABEL_BIN)):
raise FileNotFoundError(f'Cannot find obabel binary at {OBABEL_BIN}.')

with tempfile.NamedTemporaryFile(suffix=".mol2") as temp_file:
print(f"[OBABEL] Temporary file created: {temp_file.name}")
obabel_cmd = f"{OBABEL_BIN} -:'{smiles}' -omol2 -O{temp_file.name} --gen3d"
Expand Down
Loading