Accept U as a base type and convert to T for BAM. #306

jkbonfield · 2024-10-31T10:24:25Z

I see the BAM encoder disallows U.

We recently had a samtools/htslib issue where the input U characters in SAM were being converted to N due to the in-memory encoding (essentially BAM) treating any unrecognised character as unknown. Their input data came from ONT, which we know can produce U in FASTQ. In BAM the Dorado software converts those to T (so they've already chosen that as their solution), but I don't know where the SAM came from. Regardless, it exists and is genuine user data.

IUPAC does acknowledge U as a base type (the original text referred to V as not-T or not-U), but it's obvious they don't disambiguate between them as we don't have different codes for A/T and A/U. I think it was probably an error made very early on in SAM/BAM and samtools to not add U into the lookup table as it's unambiguous in meaning, and the SAM spec even mentions DNA/RNA in the text so it was clearly intended to work with both.

I've already fixed this in htslib (not yet in a release) as evidently users need at least one efficient solution that can convert Us to Ts in sequences and isn't a perl, python etc slow one-liner. :). However I'd advocate for it being supported by all major implementations.

See samtools/hts-specs#800 and samtools/hts-specs#801 for more discussion on this, which discusses the SAM regexp description and also how we should track whether this conversion was done.

PS. I've no idea what to do with CRAM! For us in htslib it's a moot point as htslib converts to nibble encoding anyway, so we cannot present data to the CRAM encoder that contains U, but technically CRAM could store it. It's not ideal though and would cause lots of wasted space with reference-based encoding with all T vs U being an edit. My gut feeling is it should also just store U as T and use meta-data somewhere to flag it.

The text was updated successfully, but these errors were encountered:

zaeleus · 2024-11-05T21:13:52Z

Thanks for the heads up. I'll watch the hts-spec issues for further updates.

I see the BAM encoder disallows U.

noodles-bam does accept U as per the SAM/BAM specification (§ 4.2.3 "SEQ and QUAL encoding" (2023-11-16)), which makes it clear that U is mapped to N.

The case-insensitive base codes '=ACMGRSVTWYHKDBN' are mapped to [0, 15] respectively with all other characters mapping to N (value 15).

As you noted, this issue is primarily driven by htslib's use of 4-bit sequences. noodles' implementation supports U in SAM record sequences.

$ samtools-1.21 view xx\#u.sam
a1	99	xx	1	1	16M	=	11	20	=ACMGRSVTWYHKDBN	****************
b1	99	xx	1	1	16M	=	11	20	=ACMGRSVNWYHKDBN	****************

$ cargo run --example sam_view xx\#u.sam
a1	99	xx	1	1	16M	=	11	20	=ACMGRSVTWYHKDBN	****************
b1	99	xx	1	1	16M	=	11	20	=ACMGRSVUWYHKDBN	****************

jkbonfield · 2024-11-06T09:34:17Z

htsjdk is the same, with U in SAM being preserved, but practically speaking people use BAM more than SAM. It looks like Noodles would convert those SAM Us to BAM Ns. This is what I changed in htslib and what I think the BAM specification should be clarifying. It doesn't help anyone to treat U as N for BAM IMO.

zaeleus added the bam label Oct 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept U as a base type and convert to T for BAM. #306

Accept U as a base type and convert to T for BAM. #306

jkbonfield commented Oct 31, 2024

zaeleus commented Nov 5, 2024

jkbonfield commented Nov 6, 2024

Accept U as a base type and convert to T for BAM. #306

Accept U as a base type and convert to T for BAM. #306

Comments

jkbonfield commented Oct 31, 2024

zaeleus commented Nov 5, 2024

jkbonfield commented Nov 6, 2024