-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
INFO/END should not be deprecated #784
Comments
I am not sure I agree with the 'without public discussion' part; we had two release candidates of 4.5 out for several months and requested comments; notes went out via the GA4GH newsletter, Twitter/X updates from the spec maintainers, etc. Always happy to find ways to improve the RFC phase and would appreciate any suggestions, but it continues to be difficult to get feedback to changes until after the actual release. This just as an aside, I do not want to take away from your core argument regarding the change. |
Quite the opposite. END was deprecated in favour of FORMAT LEN as END is incompatible with per-sample gVCF ( I strived to maintain maximum backwards compatibilty when depreciating END and the only issue with just including END the misinterpretation of END when per-sample gVCF blocks are written. Happy to errata the VCFv4.5 specs to recommend END be included in the header & records iff there are no per-sample gVCF blocks in the VCF. There's some risk of misinterpretation by a pre-4.5 gVCF-aware parser but I would expect them to fail with a sematic parsing error when encountering a In addition to the two rounds of public feedback, there was discussion of this change in the Future of VCF working group (where all the scalability changes were discussed and decided on) but it's difficult to keep everyone impacted informed of all relevant changes. I think it would be a good idea to have ADRs for any spec change to any of the hts-specs documents as, although the actual changes are visible, the reasoning behind making any given change (e.g. why FORMAT LEN was chosen over FORMAT END) can be opaque or lost to time. |
Thank you @ohofmann and to @jkbonfield for pointing out this was discussed in other channels. I commented on the pull request #758 on a different topic. Now I see that deprecation of INFO/END was mentioned in the thread there, but as it was not highlighted as one of the main points in the pull request message, I was not aware of it. I think the appropriate place for discussion would be github discussions, I am afraid many people, me included, don't have the capacity to attend GA4GH calls. |
I understand how big sample sizes result in the fragmentation of REF blocks. I don't understand how do you propose the indexers should operate. Parse all FORMAT fields to figure out the maximum block size, as suggested by #758 (comment)? That would be terribly inefficient. Or give up on indexing REF-only blocks? |
There were attempts to raise awareness of VCF 4.5 release candidates on github, in a GA4GH email circular IIRC, and also via twitter (eg https://twitter.com/camerongenomics/status/1815619781199892971, including INFO/END, reposted numerous times including myself, John and Oliver). Understood that not everyone has time to attend the conference calls or carefully read every PR, but for a new release it may be prudent to catch up on the change summary as it's very hard to undo a change once it goes live. Agreed that the GA4GH community could improve too. In light of that, do you have any ideas on how to better reach interested parties such as yourself? This has been a long standing issue. I faced the same problem with the SAM base modification tags. It was like pulling teeth getting any community review, until after a couple of years I gave up and published what I had. Then all the problems started to flood in, as no one wants to do the proof-reading stage. It's most demoralising as a maintainer! Regardless, I think this is maybe also a language / phrasing issue. My understanding from speaking with Daniel is that while INFO/END is deprecated as a means of computing the END of a variant, in favour of sample-specific methods given INFO/END is quite simply inaccurate in any file with more than one sample present, it isn't deprecated as a means to building an index where it is recommended to be the largest of all sample ENDs. We have a conference call this evening at UK 9pm. It is likely Daniel will be there, so it would be far easier to discuss your concerns via a call. You (and others) would be most welcome. Details are on the file formats mailing list, or contact Reggan. (https://www.ga4gh.org/document/meeting-minutes-file-formats-2023/) |
I don't understand what the ADR acronym is, but I think I get the gist. Our minutes are sadly rather poor. Generally there's no transcript at all of what's said, and mostly the recording isn't available either. This is something we ought to improve on, but it takes time and it's not something I can do while also chairing the meeting without substantially slowing things down. However the minute details probably don't matter. A rationale for a change does, and it can be seen as sifting through all the chatter (which is likely spread over many conference calls) and distilling it to a few key points listing the pros and cons of every change, the initial impetus for change, etc. Mostly they'll be trivial too ("fixed typo", or "clarified wording"). Any suggestions where they should be placed though? In the document itself as an appendix (this could be large). In a separate document in this repository? Somewhere else? (Catch-22: this question is best answered by the people who aren't reading it!) |
ADR is Architectural Decision Record. See for example this one for the sequence collections specification. |
I've been reading up on https://adr.github.io/ after this morning's discussion. |
Thanks. I dislike hiding it away in a completely different location though. After our conference call last night I knew there was something used by refget yet I still couldn't find it - even knowing it existed! It needs to be discoverable. That means a link to it in the Change Log section at the end of the primary TeX document, and ideally also discoverable while simply browsing through github. Perhaps just sibling filenames so they naturally appear together, eg VCF4.5.tex and VCF4.5.ADR.md? |
I don't have a good answer for that, but I don't think Twitter does it. It's been my long held belief that there should be only one online channel that interested parties should monitor, otherwise information gets lost. Github issues and discussions seem like the most appropriate place to me. Returning back to INFO/END, when you say it is "inaccurate", we first must understand its semantics. INFO/END is not as much about the length of the biological variant, but about the VCF record and its span. The tag should be viewed as a purely technical instrument for the indexer to understand which records overlap with any given chr:beg-end region. This has nothing to do with alleles in individual samples. I did not raise this as a problem because yes, in agreement with #758 it makes perfect sense to keep per-allele SVLEN, per-sample FORMAT/LEN, etc. But to me it makes no sense to deprecate INFO/END and compute it from FORMAT fields on the fly while indexing. That would be terribly inefficient. And if the idea is to stop indexing such records, then again, that would be such a major change that it must be made very clear and obvious, in github discussions and in the specification. |
As I said above, I think some of this is just miscommunication. The current specification states: "\item END: Deprecated. That doesn't imply it's not included, but that it's no longer the primary source of information for VCF4.5 and is present purely for the purposes of backwards compatibility with older software. Granted I'm not sure the wording is ideal, but please suggest alternatives. Whether that is wise or not is another issue, but fear over this breaking existing code is unfounded. |
The very wording "is present for the purposes of backwards compatibility with older software" is what I am having a problem with and what this issue is about. My question stands: if you write a new indexer or a new code that checks record overlaps, are you expected to parse the entire FORMAT section to learn what previously was present in INFO/END, but does not have to be present any longer, according to the new specification? |
I think the language is poor and needs clarification. My understanding is "does not have to be present" isn't really true, or at least isn't advisable. I think it's basically trying to say the information isn't the right solution now to identify the size of the variant, but should still exist for purposes of indexing. What's missing is the rationalisation and explanatory text detailing what authors of new tools ought to be doing. Is it advisable for anyone new coming to this to still use INFO/END, or should we be recommended them to deduce it via other means? I don't know the answer to this one. I would say though that there have always been things you need to do when dealing with multi-sample data. Eg if you do a range query and a sample subset, then INFO/END may not be enough as it's the superset of all samples and not necessarily the one you're querying. I guess this fits in with the generally accepted API/library of the word "deprecated", where it implies this function is no longer the current recommended one and it's been replaced with something else (possibly but not necessarily also indicating it'll be removed at some point, but for us that'd have to be a VCF 5). |
That is correct in that the indexes we have so far are purely site-oriented, not sample-oriented. If the indexing is extended in the future to support individual samples, then parsing FORMAT fields may become possible. But until that happens, I strongly disagree that INFO/END should or could be deprecated. I'd suggest to word it something like this:
|
It appears that it was decided to deprecate INFO/END, without any public discussion (97b26a1)
Although I see the advantage of using SVLEN, as it is per-allele, rather than per site information, I argue that there is a value in keeping a INFO/END as a per site information #769 (comment).
The deprecation seems to be motivated purely by the redundancy brought by SVLEN. However, END is also used in gVCF.
Therefore, while I understand the need for SVLEN, I wonder what is the advantage of deprecating END? All software that handles indexing, VCF record overlaps, all gVCF-aware programs have to be reviewed and possibly adjusted. That's a major effort requiring coordination of many people. Is this all for a few SV-producing programs to avoid writing one extra INFO tag, or am I missing something?
The text was updated successfully, but these errors were encountered: