Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accept U as a base type and convert to T for BAM. #306

Open
jkbonfield opened this issue Oct 31, 2024 · 2 comments
Open

Accept U as a base type and convert to T for BAM. #306

jkbonfield opened this issue Oct 31, 2024 · 2 comments
Labels

Comments

@jkbonfield
Copy link

I see the BAM encoder disallows U.

We recently had a samtools/htslib issue where the input U characters in SAM were being converted to N due to the in-memory encoding (essentially BAM) treating any unrecognised character as unknown. Their input data came from ONT, which we know can produce U in FASTQ. In BAM the Dorado software converts those to T (so they've already chosen that as their solution), but I don't know where the SAM came from. Regardless, it exists and is genuine user data.

IUPAC does acknowledge U as a base type (the original text referred to V as not-T or not-U), but it's obvious they don't disambiguate between them as we don't have different codes for A/T and A/U. I think it was probably an error made very early on in SAM/BAM and samtools to not add U into the lookup table as it's unambiguous in meaning, and the SAM spec even mentions DNA/RNA in the text so it was clearly intended to work with both.

I've already fixed this in htslib (not yet in a release) as evidently users need at least one efficient solution that can convert Us to Ts in sequences and isn't a perl, python etc slow one-liner. :). However I'd advocate for it being supported by all major implementations.

See samtools/hts-specs#800 and samtools/hts-specs#801 for more discussion on this, which discusses the SAM regexp description and also how we should track whether this conversion was done.

PS. I've no idea what to do with CRAM! For us in htslib it's a moot point as htslib converts to nibble encoding anyway, so we cannot present data to the CRAM encoder that contains U, but technically CRAM could store it. It's not ideal though and would cause lots of wasted space with reference-based encoding with all T vs U being an edit. My gut feeling is it should also just store U as T and use meta-data somewhere to flag it.

@zaeleus zaeleus added the bam label Oct 31, 2024
@zaeleus
Copy link
Owner

zaeleus commented Nov 5, 2024

Thanks for the heads up. I'll watch the hts-spec issues for further updates.

I see the BAM encoder disallows U.

noodles-bam does accept U as per the SAM/BAM specification (§ 4.2.3 "SEQ and QUAL encoding" (2023-11-16)), which makes it clear that U is mapped to N.

The case-insensitive base codes '=ACMGRSVTWYHKDBN' are mapped to [0, 15] respectively with all other characters mapping to N (value 15).

As you noted, this issue is primarily driven by htslib's use of 4-bit sequences. noodles' implementation supports U in SAM record sequences.

$ samtools-1.21 view xx\#u.sam
a1	99	xx	1	1	16M	=	11	20	=ACMGRSVTWYHKDBN	****************
b1	99	xx	1	1	16M	=	11	20	=ACMGRSVNWYHKDBN	****************

$ cargo run --example sam_view xx\#u.sam
a1	99	xx	1	1	16M	=	11	20	=ACMGRSVTWYHKDBN	****************
b1	99	xx	1	1	16M	=	11	20	=ACMGRSVUWYHKDBN	****************

@jkbonfield
Copy link
Author

htsjdk is the same, with U in SAM being preserved, but practically speaking people use BAM more than SAM. It looks like Noodles would convert those SAM Us to BAM Ns. This is what I changed in htslib and what I think the BAM specification should be clarifying. It doesn't help anyone to treat U as N for BAM IMO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants