Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in metadata reading and spectral matching #109

Open
hechth opened this issue Oct 19, 2023 · 10 comments
Open

Bug in metadata reading and spectral matching #109

hechth opened this issue Oct 19, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@hechth
Copy link

hechth commented Oct 19, 2023

When reading spectra from the files in the attached archive, multiple things go wrong.

Firstly, some metadata is not read correctly (missing and inserted as NA) and also the individual entries end up in the wrong places, so the InChIKey from spectrum 2 is assigned to spectrum 1 and InChIKey of spectrum 2 is then NA.

new("Spectra", backend = new("MsBackendMsp", spectraData = new("DFrame", 
    rownames = NULL, nrows = 2L, elementType = "ANY", elementMetadata = NULL, 
    metadata = list(), listData = list(name = c("2,2',3,4',5,5'-Hexachloro-4-methoxybiphenyl", 
    "Pendimethalin"), RETENTION_TIME = c("None", "None"), RETENTION_INDEX = c("2554.1", 
    "2044.6"), PRECURSOR_MZ = c("387.85245", "281.13574"), ADDUCT = c("[M]+", 
    "[M]+"), COLLISION_ENERGY = c("70eV", "70eV"), INSTRUMENT_TYPE = c("GC-EI-Orbitrap", 
    "GC-EI-Orbitrap"), NUM.PEAKS = c("164", "86"), SCANNUMBER = c("-1", 
    NA), SPECTRUMTYPE = c("Centroid", NA), formula = c("C13H19N3O4", 
    NA), inchikey = c("CHIFOSRWCNZCFN-UHFFFAOYSA-N", NA), smiles = c("CCC(CC)NC1=C(C=C(C(=C1[N+](=O)[O-])C)C)[N+](=O)[O-]", 
    NA), AUTHORS = c("Price et al., RECETOX, Masaryk University (CZ)", 
    NA), instrument = c("Q Exactive GC Orbitrap GC-MS/MS", NA
    ), IONIZATION = c("EI+", NA), LICENSE = c("CC BY-NC", NA), 
        mz = new("SimpleNumericList", elementType = "numeric", 
            elementMetadata = NULL, metadata = list(), listData = list(

Also, during matching, not all scores are calculated or they are listed as NA.

The code used for matching is the following:

data_reference <- Spectra(reference_file, source = MsBackendMsp::MsBackendMsp())
data_simulated <- Spectra(simulated_file, source = MsBackendMsp::MsBackendMsp())


# Define match parameters
match_param <- MetaboAnnotation::MatchForwardReverseParam(
  requirePrecursor = FALSE,
  ppm = ppm, 
  FUN = MsCoreUtils::ndotproduct, 
  THRESHFUN = function(x) which(x >= 0.0), 
  THRESHFUN_REVERSE = function(x) which(x >= 0.0)
)

# Perform matching
matched_spectra <- MetaboAnnotation::matchSpectra(data_simulated, data_reference, match_param)

# Convert matched spectra to data frame
matched_spectra_df <- spectraData(matched_spectra, columns = c("name", "target_name", "reverse_score", "score", "presence_ratio", "matched_peaks_count"))
matched_spectra_df <- as.data.frame(matched_spectra_df)

Also, to actually get the 0 scores, the threshold functions have to be extended with | TRUE because 0 scores seem to be represented as NA or so.

problematic.zip

@jorainer jorainer added the bug Something isn't working label Oct 23, 2023
@jorainer
Copy link
Member

OK, so seems there are several problems. I will look into it, thanks for reporting.

@hechth
Copy link
Author

hechth commented Oct 24, 2023

The bug with the metadata reading and missing scores is very bizarre and I also have no idea. Are spectra somehow read in batches or so?

@jorainer
Copy link
Member

Looks like we have problems with the msp files you provided. Are these in "standard" format? I have trouble finding a proper definition of the file format - NIST however defines that each spectrum has to start with NAME: - in your case the NAME field is not the first line per spectrum, so, all elements before that line get assigned to the previous spectrum. I could add a fix for that splitting by empty lines instead of NAME elements.

The other problem is the peak list - you have in addition to the 2 elements per row (m/z and intensity) also sometimes a third element with annotation. That is at present not properly handled. I could add support for that, but would be nice to have some reference/format definition.

could well be that the problem you see later with the scores is related to the problem that the peak values are not correctly handled.

@hechth
Copy link
Author

hechth commented Oct 25, 2023

There is no proper definition of the MSP file format :D

I would advise against trying to fix it because you will run into the same issues as we do with matchms where you have to support a million flavours of MSP - maybe I can just convert the spectra to NIST format and then force NAME to be the first row on NIST - how does spectra deal with it if there is no NAME present?

@jorainer
Copy link
Member

if there is no NAME it will consider the full content of a msp file as being one single spectrum... we're essentially splitting by NAME. but the thing is we could split by whitespace instead. which would then not require any specific order of elements.

A bigger problem for now is the 3rd column of the peaks data. I will have to think how to support that (it makes sense to also provide peak annotations if available...)

@hechth
Copy link
Author

hechth commented Oct 25, 2023

if there is no NAME it will consider the full content of a msp file as being one single spectrum... we're essentially splitting by NAME. but the thing is we could split by whitespace instead. which would then not require any specific order of elements.

A bigger problem for now is the 3rd column of the peaks data. I will have to think how to support that (it makes sense to also provide peak annotations if available...)

I'm not sure if this makes sense. I'd rather see an R implementation of mzSpecLib and abandon MSP files all together - nothing is standardized etc. - we can remove the comments with matchms, that is already implemented - so overall, will try to minimize the msp and remove comments and switch to NIST format.

@jorainer
Copy link
Member

maybe mgf would be more standardized as an alternative?

@hechth
Copy link
Author

hechth commented Oct 25, 2023

Yeah this is also something to try

@jorainer
Copy link
Member

Anyway. We need to at least throw an error or similar if we get encounter an unexpected MSP format.

@jorainer
Copy link
Member

I have a PR in MsBackendMsp that fixes the issues we have with your MSP files. With that new version it would be possible to properly handle and read your files. PR: rformassspectrometry/MsBackendMsp#15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants