Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: replace pyliftover with agct to improve performance #264

Merged
merged 3 commits into from
Feb 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,9 +103,9 @@ Individual classes will accept arguments upon initialization to set parameters r
* - ``UTA_DB_URL``
- A `libpq connection string <https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING>`_, i.e. of the form ``postgresql://<user>:<password>@<host>:<port>/<database>/<schema>``, used by the :py:class:`cool_seq_tool.sources.uta_database.UtaDatabase` class. By default, it is set to ``postgresql://uta_admin:uta@localhost:5433/uta/uta_20210129b``.
* - ``LIFTOVER_CHAIN_37_TO_38``
- A path to a `chainfile <https://genome.ucsc.edu/goldenPath/help/chain.html>`_ for lifting from GRCh37 to GRCh38. Used by :py:class:`cool_seq_tool.sources.uta_database.UtaDatabase` as input to `pyliftover <https://pypi.org/project/pyliftover/>`_. If not provided, pyliftover will fetch it automatically from UCSC.
- A path to a `chainfile <https://genome.ucsc.edu/goldenPath/help/chain.html>`_ for lifting from GRCh37 to GRCh38. Used by :py:class:`cool_seq_tool.sources.uta_database.UtaDatabase` as input to `agct <https://pypi.org/project/agct/>`_. If not provided, agct will fetch it automatically from UCSC.
* - ``LIFTOVER_CHAIN_38_TO_37``
- A path to a `chainfile <https://genome.ucsc.edu/goldenPath/help/chain.html>`_ for lifting from GRCh38 to GRCh37. Used by :py:class:`cool_seq_tool.sources.uta_database.UtaDatabase` as input to `pyliftover <https://pypi.org/project/pyliftover/>`_. If not provided, pyliftover will fetch it automatically from UCSC.
- A path to a `chainfile <https://genome.ucsc.edu/goldenPath/help/chain.html>`_ for lifting from GRCh38 to GRCh37. Used by :py:class:`cool_seq_tool.sources.uta_database.UtaDatabase` as input to `agct <https://pypi.org/project/agct/>`_. If not provided, agct will fetch it automatically from UCSC.

Schema support
--------------
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ dependencies = [
"asyncpg",
"aiofiles",
"boto3",
"pyliftover",
"agct >= 0.1.0-dev1",
"polars",
"hgvs",
"biocommons.seqrepo",
Expand Down
29 changes: 14 additions & 15 deletions src/cool_seq_tool/sources/uta_database.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,16 @@
import asyncpg
import boto3
import polars as pl
from agct import Converter, Genome
from asyncpg.exceptions import InterfaceError, InvalidAuthorizationSpecificationError
from botocore.exceptions import ClientError
from pyliftover import LiftOver

from cool_seq_tool.schemas import AnnotationLayer, Assembly, Strand

# use `bound` to upper-bound UtaDatabase or child classes
UTADatabaseType = TypeVar("UTADatabaseType", bound="UtaDatabase")

# Environment variables for paths to chain files for pyliftover
# Environment variables for paths to chain files for agct
LIFTOVER_CHAIN_37_TO_38 = environ.get("LIFTOVER_CHAIN_37_TO_38")
LIFTOVER_CHAIN_38_TO_37 = environ.get("LIFTOVER_CHAIN_38_TO_37")
Comment on lines 23 to 24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@korikuzma are these still used? agct should just be pulling them via wags-tails

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooooh, I think we need to provide support to agct to allow env vars to be passed for the chain files. I plan on storing these files in EFS and pointing to that mounted location

Copy link
Member

@jsstevenson jsstevenson Feb 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to set WAGS_TAILS_DIR to a parent directory, if that helps. Alternatively, we can have agct pass along a data_dir argument to the CustomData class from wags-tails.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsstevenson What if we want to bypass wags-tails all together?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsstevenson This is what I would propose:

Current:

    def __init__(self, from_db: Genome, to_db: Genome) -> None:
        ....
        data_handler = CustomData(
            f"chainfile_{from_db.value}_to_{to_db.value}",
            "chain",
            lambda: "",
            self._download_function_builder(from_db, to_db),
            data_dir=get_data_dir() / "ucsc-chainfile",
        )
        file, _ = data_handler.get_latest()

Suggested change:

    def __init__(self, from_db: Genome, to_db: Genome, file: Optional[Path] = None) -> None:
        ...
        if not file:
            data_handler = CustomData(
                f"chainfile_{from_db.value}_to_{to_db.value}",
                "chain",
                lambda: "",
                self._download_function_builder(from_db, to_db),
                data_dir=get_data_dir() / "ucsc-chainfile",
            )
            file, _ = data_handler.get_latest()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think so -- make a PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, wanted to make sure I wasn't missing something before making changes.


Expand Down Expand Up @@ -55,13 +55,13 @@ def __init__(
:param db_url: PostgreSQL connection URL
Format: ``driver://user:password@host/database/schema``
:param chain_file_37_to_38: Optional path to chain file for 37 to 38 assembly.
This is used for ``pyliftover``. If this is not provided, will check to see
This is used for ``agct``. If this is not provided, will check to see
if ``LIFTOVER_CHAIN_37_TO_38`` env var is set. If neither is provided, will
allow ``pyliftover`` to download a chain file from UCSC
allow ``agct`` to download a chain file from UCSC
:param chain_file_38_to_37: Optional path to chain file for 38 to 37 assembly.
This is used for ``pyliftover``. If this is not provided, will check to see
This is used for ``agct``. If this is not provided, will check to see
if ``LIFTOVER_CHAIN_38_TO_37`` env var is set. If neither is provided, will
allow ``pyliftover`` to download a chain file from UCSC
allow ``agct`` to download a chain file from UCSC
"""
self.schema = None
self._connection_pool = None
Expand All @@ -71,15 +71,15 @@ def __init__(

chain_file_37_to_38 = chain_file_37_to_38 or LIFTOVER_CHAIN_37_TO_38
if chain_file_37_to_38:
self.liftover_37_to_38 = LiftOver(chain_file_37_to_38)
self.liftover_37_to_38 = Converter(chainfile=chain_file_37_to_38)
else:
self.liftover_37_to_38 = LiftOver("hg19", "hg38")
self.liftover_37_to_38 = Converter(from_db=Genome.HG19, to_db=Genome.HG38)

chain_file_38_to_37 = chain_file_38_to_37 or LIFTOVER_CHAIN_38_TO_37
if chain_file_38_to_37:
self.liftover_38_to_37 = LiftOver(chain_file_38_to_37)
self.liftover_38_to_37 = Converter(chainfile=chain_file_38_to_37)
else:
self.liftover_38_to_37 = LiftOver("hg38", "hg19")
self.liftover_38_to_37 = Converter(from_db=Genome.HG38, to_db=Genome.HG19)

def _get_conn_args(self) -> Dict:
"""Return connection arguments.
Expand Down Expand Up @@ -955,14 +955,13 @@ async def liftover_to_38(self, genomic_tx_data: Dict) -> None:

def get_liftover(
self, chromosome: str, pos: int, liftover_to_assembly: Assembly
) -> Optional[Tuple]:
) -> Optional[Tuple[str, int]]:
"""Get new genome assembly data for a position on a chromosome.

:param chromosome: The chromosome number. Must be prefixed with ``chr``
:param pos: Position on the chromosome
:param liftover_to_assembly: Assembly to liftover to
:return: [Target chromosome, target position, target strand,
conversion_chain_score] for assembly
:return: Target chromosome and target position for assembly
"""
if not chromosome.startswith("chr"):
logger.warning("`chromosome` must be prefixed with chr")
Expand All @@ -976,10 +975,10 @@ def get_liftover(
logger.warning("%s assembly not supported", liftover_to_assembly)
liftover = None

if liftover is None or len(liftover) == 0:
if not liftover:
logger.warning("%s does not exist on %s", pos, chromosome)
return None
return liftover[0]
return liftover[0][:2]

def _set_liftover(
self,
Expand Down
2 changes: 1 addition & 1 deletion src/cool_seq_tool/version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
"""Define package version."""
__version__ = "0.4.0-dev2"
__version__ = "0.4.0-dev3"
2 changes: 1 addition & 1 deletion tests/sources/test_uta_database.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,7 @@ async def test_liftover_to_38(test_db, genomic_tx_data):
def test_get_liftover(test_db):
"""Test that get_liftover works correctly."""
resp = test_db.get_liftover("chr7", 140453136, "GRCh38")
assert resp == ("chr7", 140753336, "+", 14633688187)
assert resp == ("chr7", 140753336)

resp = test_db.get_liftover("chr17", 140453136, "GRCh38")
assert resp is None
Expand Down
Loading