Difference in e-value between HMMER and pyHMMER #75

jpjarnoux · 2024-09-24T09:41:25Z

Hi!
I'm having some trouble understanding where this can come from.
I tried to align a sequence with HMMER and filter on the e-value at 1e-05.
With HMMER, I have this result in the domtblout:

WP_010791936.1       -            264 HamA1_00017          PDLC00166    282   4.3e-06   24.9   0.0   1   1   8.3e-10   5.4e-06   24.6   0.0    41   143    22   126    12   181 0.78

With pyHMMER:

WP_010791936.1 -            264 HamA1_00017          PDLC00166    282   2.5e-05   24.9   0.0   1   1   9.9e-09   3.2e-05   24.6   0.0    41   143    22   126    12   181 0.78

All values are the same except e-value, i-evalue, c-evalue.

Maybe I missed the information, but is it expected to be different?

I'm using pyhmmer=0.10.7=py310h4b81fae_0 and hmmer=3.4

Thanks for your help

The text was updated successfully, but these errors were encountered:

jpjarnoux · 2024-09-25T07:03:01Z

Hi!

I share part of my script to ensure that it's not me who is wrong here.

def digit_family_sequences(pangenome: Pangenome,
                           disable_bar: bool = False) -> tuple[List[DigitalSequence], bool]:
    """
    Digitalised pangenome gene family sequences for HMM alignment

    Args:
        pangenome: Pangenome object with gene families
        disable_bar: Flag to disable the progress bar

    Returns:
        List[pyhmmer.easel.Sequence]: list of digitalised gene family sequences
    """
    sequences = []
    logging.getLogger("PANORAMA").info("Begin to digitalized gene families sequences...")
    for family in tqdm(pangenome.gene_families, total=pangenome.number_of_gene_families, unit="gene families",
                       desc="Digitalized gene families sequences", disable=disable_bar):
        bit_name = family.name.encode('UTF-8')
        sequence = family.sequence if family.HMM is None else family.HMM.consensus.upper()
        if sequence[-1] == '*':
            sequence = sequence[:-1]
        seq = TextSequence(name=bit_name, sequence=sequence)
        sequences.append(seq.digitize(Alphabet.amino()))
    available_memory = psutil.virtual_memory().available
    return sequences, True if sys.getsizeof(sequences) < available_memory else False
    
def annot_with_hmmsearch(hmms: Dict[str, List[HMM]], gf_sequences: SequenceBlock, meta: pd.DataFrame = None,
                         threads: int = 1, disable_bar: bool = False
                         ) -> Tuple[List[Tuple[str, str, str, float, float, float, float, str, str]], List[TopHits]]:
    """
    Compute HMMer alignment between gene families sequences and HMM

    Args:
        hmms: List of HMM classified by bit_cutoffs
        gf_sequences: List of digitalized gene families sequences
        meta: Metadata associate with HMM
        threads: Number of available threads
        disable_bar:  Disable progress bar

    Returns:
         Alignment results
    """

    def hmmsearch_callback(hmm, _):
        """HMMSearch callback function for debugging
        """
        logging.getLogger('PANORAMA').debug(f"Finished annotation with HMM {hmm.name.decode()}")
        bar.update()

    res = []
    result = namedtuple("Result", res_col_names)
    logging.getLogger("PANORAMA").info("Begin alignment to HMM with HMMSearch")
    seq_file = SequenceFile("P.aeruginosa/all_protein_families.faa", digital=True)
    with tqdm(total=sum(len(hmm_list) for hmm_list in hmms.values()), unit="hmm",
              desc="Align target to HMM", disable=disable_bar) as bar:
        all_top_hits = []
        for cutoff, hmm_list in hmms.items():
            options = {}
            if cutoff != "other":
                options["bit_cutoffs"] = cutoff
            for top_hits in hmmsearch(hmm_list, seq_file, cpus=threads, callback=hmmsearch_callback, **options):
                all_top_hits.append(top_hits)
                for hit in top_hits.included:
                    assign = assign_hit(hit, meta)
                    if assign is not None:
                        res.append(result(*assign))

    return res, all_top_hits

sequences, fit_memory = digit_family_sequences(pangenome, disable_bar=disable_bar)
        if fit_memory:
            logging.getLogger("PANORAMA").debug("Launch pyHMMer-HMMSearch")
            sequence_block = DigitalSequenceBlock(alphabet=Alphabet.amino(), iterable=sequences)
            res, all_top_hits = annot_with_hmmsearch(hmms, sequence_block, meta, threads, disable_bar)
        else:
            logging.getLogger("PANORAMA").debug("Launch pyHMMer-HMMScan")
            res, all_top_hits = annot_with_hmmscan(hmms, sequences, meta, threads, tmp, disable_bar)

althonos · 2024-10-08T13:50:12Z

Hi @jpjarnoux, if the E-values are different but the P-values are the same, then it means the only difference between the two runs is the Z and domZ parameters, which are normally counted by how many sequences are contained in the target database, so the number of HMMs (with hmmsearch) or the number of sequences (with hmmscan). This is documented a bit more here: https://pyhmmer.readthedocs.io/en/stable/examples/performance_tips.html#Search-and-scan

This means in the above example block, you wont get the same E-values in the annot_with_hmmscan and annot_with_hmmsearch unless you set the Z and domZ parameters to the same value. It should be always the same value across calls to your tool, so this probably should be set to your HMM database size.

By the way, sys.getsizeof(sequences) will not give you the total memory footprint of the sequences, only of the list that stores the references to the TextSequence objects. You could do sum(map(sys.getsizeof, sequences)) to sum the size of each sequence instead.

jpjarnoux · 2024-10-15T13:19:51Z

Hi
Thanks for your reply and sorry for my delay.
I set the Z parameters as you suggest and problem solve thanks.

Thanks for the tips too.

althonos added the question Further information is requested label Oct 8, 2024

althonos mentioned this issue Oct 8, 2024

Get AllHits instead of TopHits #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in e-value between HMMER and pyHMMER #75

Difference in e-value between HMMER and pyHMMER #75

jpjarnoux commented Sep 24, 2024

jpjarnoux commented Sep 25, 2024 •

edited

Loading

althonos commented Oct 8, 2024

jpjarnoux commented Oct 15, 2024

Difference in e-value between HMMER and pyHMMER #75

Difference in e-value between HMMER and pyHMMER #75

Comments

jpjarnoux commented Sep 24, 2024

jpjarnoux commented Sep 25, 2024 • edited Loading

althonos commented Oct 8, 2024

jpjarnoux commented Oct 15, 2024

jpjarnoux commented Sep 25, 2024 •

edited

Loading