Bug in metadata reading and spectral matching #109

hechth · 2023-10-19T09:47:18Z

When reading spectra from the files in the attached archive, multiple things go wrong.

Firstly, some metadata is not read correctly (missing and inserted as NA) and also the individual entries end up in the wrong places, so the InChIKey from spectrum 2 is assigned to spectrum 1 and InChIKey of spectrum 2 is then NA.

new("Spectra", backend = new("MsBackendMsp", spectraData = new("DFrame", 
    rownames = NULL, nrows = 2L, elementType = "ANY", elementMetadata = NULL, 
    metadata = list(), listData = list(name = c("2,2',3,4',5,5'-Hexachloro-4-methoxybiphenyl", 
    "Pendimethalin"), RETENTION_TIME = c("None", "None"), RETENTION_INDEX = c("2554.1", 
    "2044.6"), PRECURSOR_MZ = c("387.85245", "281.13574"), ADDUCT = c("[M]+", 
    "[M]+"), COLLISION_ENERGY = c("70eV", "70eV"), INSTRUMENT_TYPE = c("GC-EI-Orbitrap", 
    "GC-EI-Orbitrap"), NUM.PEAKS = c("164", "86"), SCANNUMBER = c("-1", 
    NA), SPECTRUMTYPE = c("Centroid", NA), formula = c("C13H19N3O4", 
    NA), inchikey = c("CHIFOSRWCNZCFN-UHFFFAOYSA-N", NA), smiles = c("CCC(CC)NC1=C(C=C(C(=C1[N+](=O)[O-])C)C)[N+](=O)[O-]", 
    NA), AUTHORS = c("Price et al., RECETOX, Masaryk University (CZ)", 
    NA), instrument = c("Q Exactive GC Orbitrap GC-MS/MS", NA
    ), IONIZATION = c("EI+", NA), LICENSE = c("CC BY-NC", NA), 
        mz = new("SimpleNumericList", elementType = "numeric", 
            elementMetadata = NULL, metadata = list(), listData = list(

Also, during matching, not all scores are calculated or they are listed as NA.

The code used for matching is the following:

data_reference <- Spectra(reference_file, source = MsBackendMsp::MsBackendMsp())
data_simulated <- Spectra(simulated_file, source = MsBackendMsp::MsBackendMsp())


# Define match parameters
match_param <- MetaboAnnotation::MatchForwardReverseParam(
  requirePrecursor = FALSE,
  ppm = ppm, 
  FUN = MsCoreUtils::ndotproduct, 
  THRESHFUN = function(x) which(x >= 0.0), 
  THRESHFUN_REVERSE = function(x) which(x >= 0.0)
)

# Perform matching
matched_spectra <- MetaboAnnotation::matchSpectra(data_simulated, data_reference, match_param)

# Convert matched spectra to data frame
matched_spectra_df <- spectraData(matched_spectra, columns = c("name", "target_name", "reverse_score", "score", "presence_ratio", "matched_peaks_count"))
matched_spectra_df <- as.data.frame(matched_spectra_df)

Also, to actually get the 0 scores, the threshold functions have to be extended with | TRUE because 0 scores seem to be represented as NA or so.

problematic.zip

The text was updated successfully, but these errors were encountered:

jorainer · 2023-10-24T05:16:04Z

OK, so seems there are several problems. I will look into it, thanks for reporting.

hechth · 2023-10-24T08:08:53Z

The bug with the metadata reading and missing scores is very bizarre and I also have no idea. Are spectra somehow read in batches or so?

jorainer · 2023-10-25T06:41:44Z

Looks like we have problems with the msp files you provided. Are these in "standard" format? I have trouble finding a proper definition of the file format - NIST however defines that each spectrum has to start with NAME: - in your case the NAME field is not the first line per spectrum, so, all elements before that line get assigned to the previous spectrum. I could add a fix for that splitting by empty lines instead of NAME elements.

The other problem is the peak list - you have in addition to the 2 elements per row (m/z and intensity) also sometimes a third element with annotation. That is at present not properly handled. I could add support for that, but would be nice to have some reference/format definition.

could well be that the problem you see later with the scores is related to the problem that the peak values are not correctly handled.

hechth · 2023-10-25T09:47:57Z

There is no proper definition of the MSP file format :D

I would advise against trying to fix it because you will run into the same issues as we do with matchms where you have to support a million flavours of MSP - maybe I can just convert the spectra to NIST format and then force NAME to be the first row on NIST - how does spectra deal with it if there is no NAME present?

jorainer · 2023-10-25T10:16:49Z

if there is no NAME it will consider the full content of a msp file as being one single spectrum... we're essentially splitting by NAME. but the thing is we could split by whitespace instead. which would then not require any specific order of elements.

A bigger problem for now is the 3rd column of the peaks data. I will have to think how to support that (it makes sense to also provide peak annotations if available...)

hechth · 2023-10-25T11:02:37Z

if there is no NAME it will consider the full content of a msp file as being one single spectrum... we're essentially splitting by NAME. but the thing is we could split by whitespace instead. which would then not require any specific order of elements.

A bigger problem for now is the 3rd column of the peaks data. I will have to think how to support that (it makes sense to also provide peak annotations if available...)

I'm not sure if this makes sense. I'd rather see an R implementation of mzSpecLib and abandon MSP files all together - nothing is standardized etc. - we can remove the comments with matchms, that is already implemented - so overall, will try to minimize the msp and remove comments and switch to NIST format.

jorainer · 2023-10-25T11:43:15Z

maybe mgf would be more standardized as an alternative?

hechth · 2023-10-25T12:32:28Z

Yeah this is also something to try

jorainer · 2023-10-25T12:53:56Z

Anyway. We need to at least throw an error or similar if we get encounter an unexpected MSP format.

jorainer · 2023-10-26T06:58:43Z

I have a PR in MsBackendMsp that fixes the issues we have with your MSP files. With that new version it would be possible to properly handle and read your files. PR: rformassspectrometry/MsBackendMsp#15

jorainer added the bug Something isn't working label Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in metadata reading and spectral matching #109

Bug in metadata reading and spectral matching #109

hechth commented Oct 19, 2023

jorainer commented Oct 24, 2023

hechth commented Oct 24, 2023

jorainer commented Oct 25, 2023

hechth commented Oct 25, 2023

jorainer commented Oct 25, 2023

hechth commented Oct 25, 2023 •

edited

Loading

jorainer commented Oct 25, 2023

hechth commented Oct 25, 2023

jorainer commented Oct 25, 2023

jorainer commented Oct 26, 2023

Bug in metadata reading and spectral matching #109

Bug in metadata reading and spectral matching #109

Comments

hechth commented Oct 19, 2023

jorainer commented Oct 24, 2023

hechth commented Oct 24, 2023

jorainer commented Oct 25, 2023

hechth commented Oct 25, 2023

jorainer commented Oct 25, 2023

hechth commented Oct 25, 2023 • edited Loading

jorainer commented Oct 25, 2023

hechth commented Oct 25, 2023

jorainer commented Oct 25, 2023

jorainer commented Oct 26, 2023

hechth commented Oct 25, 2023 •

edited

Loading