You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We recently had a samtools/htslib issue where the input U characters in SAM were being converted to N due to the in-memory encoding (essentially BAM) treating any unrecognised character as unknown. Their input data came from ONT, which we know can produce U in FASTQ. In BAM the Dorado software converts those to T (so they've already chosen that as their solution), but I don't know where the SAM came from. Regardless, it exists and is genuine user data.
IUPAC does acknowledge U as a base type (the original text referred to V as not-T or not-U), but it's obvious they don't disambiguate between them as we don't have different codes for A/T and A/U. I think it was probably an error made very early on in SAM/BAM and samtools to not add U into the lookup table as it's unambiguous in meaning, and the SAM spec even mentions DNA/RNA in the text so it was clearly intended to work with both.
I've already fixed this in htslib (not yet in a release) as evidently users need at least one efficient solution that can convert Us to Ts in sequences and isn't a perl, python etc slow one-liner. :). However I'd advocate for it being supported by all major implementations.
See samtools/hts-specs#800 and samtools/hts-specs#801 for more discussion on this, which discusses the SAM regexp description and also how we should track whether this conversion was done.
PS. I've no idea what to do with CRAM! For us in htslib it's a moot point as htslib converts to nibble encoding anyway, so we cannot present data to the CRAM encoder that contains U, but technically CRAM could store it. It's not ideal though and would cause lots of wasted space with reference-based encoding with all T vs U being an edit. My gut feeling is it should also just store U as T and use meta-data somewhere to flag it.
The text was updated successfully, but these errors were encountered:
htsjdk is the same, with U in SAM being preserved, but practically speaking people use BAM more than SAM. It looks like Noodles would convert those SAM Us to BAM Ns. This is what I changed in htslib and what I think the BAM specification should be clarifying. It doesn't help anyone to treat U as N for BAM IMO.
I see the BAM encoder disallows U.
We recently had a samtools/htslib issue where the input
U
characters in SAM were being converted toN
due to the in-memory encoding (essentially BAM) treating any unrecognised character as unknown. Their input data came from ONT, which we know can produce U in FASTQ. In BAM the Dorado software converts those to T (so they've already chosen that as their solution), but I don't know where the SAM came from. Regardless, it exists and is genuine user data.IUPAC does acknowledge U as a base type (the original text referred to V as not-T or not-U), but it's obvious they don't disambiguate between them as we don't have different codes for A/T and A/U. I think it was probably an error made very early on in SAM/BAM and samtools to not add U into the lookup table as it's unambiguous in meaning, and the SAM spec even mentions DNA/RNA in the text so it was clearly intended to work with both.
I've already fixed this in htslib (not yet in a release) as evidently users need at least one efficient solution that can convert Us to Ts in sequences and isn't a perl, python etc slow one-liner. :). However I'd advocate for it being supported by all major implementations.
See samtools/hts-specs#800 and samtools/hts-specs#801 for more discussion on this, which discusses the SAM regexp description and also how we should track whether this conversion was done.
PS. I've no idea what to do with CRAM! For us in htslib it's a moot point as htslib converts to nibble encoding anyway, so we cannot present data to the CRAM encoder that contains U, but technically CRAM could store it. It's not ideal though and would cause lots of wasted space with reference-based encoding with all T vs U being an edit. My gut feeling is it should also just store U as T and use meta-data somewhere to flag it.
The text was updated successfully, but these errors were encountered: