Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MzIdentML Validation Feature #78

Open
9 tasks
sureshhewabi opened this issue Sep 19, 2024 · 10 comments
Open
9 tasks

MzIdentML Validation Feature #78

sureshhewabi opened this issue Sep 19, 2024 · 10 comments
Labels
CrossLinkingValidationLib Changes related with Crosslinking validations

Comments

@sureshhewabi
Copy link
Collaborator

  • 1. Validation of Crosslinking MzIdentML (mzID) files.

    • a. Validate files of a given folder(Input will be file path)
    • b. Multiple MzIdentML files
    • c. Individual MzIdentML file
    • d. Schema Validations
    • e. Semantic Validations(If any)
    • f. References are valid
    • g. Other validations(eg: No of Protein, peptide, spectra)
  • 2. Generate validation report
    We can generate a simple report consisting above information

This is a simple start, and let's use this issue to discuss on validation

@sureshhewabi sureshhewabi added the CrossLinkingValidationLib Changes related with Crosslinking validations label Sep 19, 2024
@colin-combe
Copy link

one issue here is that to fully validate the mzIdentML file you also need the peaklists, e.g. #81

@colin-combe
Copy link

what are peoples views on how to deal with that? Two alternatives would be:

  • allow partial validation without peaklists. PRO: people don't have to deal with moving the peaklist files around. CON: later it could all turn out to be broken
  • require peaklists are provided in order to validate. PRO: you know its right; CON: peaklist files

@colin-combe
Copy link

what if we concentrate on (a) 'Validate files of a given folder(Input will be file path)',
and this folder must also contain the peaklist files?

This is easiest to do because its most like how the converter already works.

Also, if it just stops after the first error, then that's easier.

Thoughts on this?

@colin-combe
Copy link

colin-combe commented Sep 23, 2024

@sureshhewabi - #82 - you can take a look at what I've done there

that PR gives a command line validation option.

So, as a first attempt, i think covers 1. (a), (c), (d), (e) to very a limited extent, and (f) above.
1.(b) we could live without in short term. 1 (g), as i read it, isn't really validation but summary stats, these could be got by querying the sqlite DB.

For 2. above, info is printed to standard output, think it currently includes the logging info we usually see from the converter.

Its not extensively tested. It passes the file Diogo provided. It fails the schema invalid Kojak file.

sureshhewabi pushed a commit that referenced this issue Sep 23, 2024
@sureshhewabi
Copy link
Collaborator Author

what if we concentrate on (a) 'Validate files of a given folder(Input will be file path)', and this folder must also contain the peaklist files?

This is easiest to do because its most like how the converter already works.

Also, if it just stops after the first error, then that's easier.

Thoughts on this?

Yes I agree with that

@colin-combe
Copy link

I agree with that

good, that's the way it works in that PR

@colin-combe
Copy link

Its not currently rejecting files that don't have the sequences in Seq elements.
(That additional requirement of ours.)
It means they break later. (Also of no use to PDB-IHN without sequences?)
I'll need to change so it rejects these.

@colin-combe
Copy link

colin-combe commented Sep 25, 2024

Also, I think I've found another requirement specific to our system - that all Modifications have masses given.

@colin-combe
Copy link

Also, I think I've found another requirement specific to our system - that all Modifications have masses given.

hmm, i think we shouldn't add that as a requirement, rather the spectrum viewer is broken in some cases at the moment.
(There are other ways the modification masses could be recovered, like the UNIMOD accessions i think.)

@colin-combe
Copy link

so, in PR #84
i think validation works.

It checks that the Seq element is there for target proteins.

I temporarily disabled a check that crosslinks have valid link sites because I think the main file @aozalevsky is using for testing does have invalid link sites in it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CrossLinkingValidationLib Changes related with Crosslinking validations
Projects
None yet
Development

No branches or pull requests

2 participants