Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two different chemical names appear in the file and point to different molecular entities #255

Open
LiuLime opened this issue Feb 22, 2024 · 10 comments

Comments

@LiuLime
Copy link

LiuLime commented Feb 22, 2024

Hello,

When I am cleaning the massbank data, I found that:

(1) Confusing name records

In MSBNK-ACES_SU institute provided datasets, there are suspicious data records. They provided two different chemical names in files which represented totally different moleculars.
截屏2024-02-22 12 09 26

According to exactmass and formula in the example, I personally judged that the first name record is correct, the second one, for some reason, maybe uploaded by false conduction.

Here are a few suspicious records I found (more may exist):

MSBNK-ACES_SU-AS000181
MSBNK-ACES_SU-AS000133
MSBNK-ACES_SU-AS000121
MSBNK-ACES_SU-AS000110
MSBNK-ACES_SU-AS000089
MSBNK-ACES_SU-AS000004
MSBNK-ACES_SU-AS000160
MSBNK-ACES_SU-AS000201

(2) In MASSBANK_Athens_Univ record, CAS number may indicate different form to the molecular author may want to upload.

截屏2024-02-22 13 03 46

The blue arrow indicate that uploaded CAS number searched result, and red arrow indicate the correct CAS number I think.

Cause CAS number is very useful for characterizing molecules precisely, especially for distinguishing isomers (better than inchi or inchikey which sometimes cannot distinguish isomers with different conformation), it would be helpful if they are correct :)

Thank you very much!

@schymane
Copy link
Member

Thank you for the feedback about these records...

@meier-rene I can confirm that it seems MSBNK-Athens_Univ-AU113805 and MSBNK-Athens_Univ-AU113804 should have the CAS number corrected to 20574-50-9. The CAS number appears to be correct in MSBNK-Athens_Univ-AU113801. Do you wish to do this update your side?

For some reason these records are not appearing in PubChem, I have reported this separately, but the CAS is aligned with the CAS on PubChem where the equivalent MoNA records appear as well.

I will need more time to check the name issue.

@schymane
Copy link
Member

Regarding the names in the ACESx spectra, some look OK to me, some look like they need fixing. If you disagree with my assessment can you please provide more exact details in your report so we know what exact issue you mean?

Records that you flagged that seem OK to me

  • MSBNK-ACES_SU-AS000004 <= looks OK to me (not ideal but not incorrect either, the second name is just less specific than it could be)
  • MSBNK-ACES_SU-AS000089 <= the second name is the IUPAC name for Carbamazepine - see query
  • MSBNK-ACES_SU-AS000160 <= same as MSBNK-ACES_SU-AS000089, it is also Carbamazepine & its IUPAC name
  • MSBNK-ACES_SU-AS000133 <= the second name Leucoline is associated with several quinoline substance records in PubChem and does not appear to be associated with other records instead, it seems OK to me?
  • MSBNK-ACES_SU-AS000201 <= same as MSBNK-ACES_SU-AS000133, looks fine to me.

Records that you flagged that seem to have errors that need fixing

  • MSBNK-ACES_SU-AS000110 <= indeed this one has a problem, the second name is wrong (it is the IUPAC name of malathion) and should be removed

CH$NAME: Erythromycin
CH$NAME: Diethyl 2-dimethoxyphosphinothioylsulfanylbutanedioate

  • MSBNK-ACES_SU-AS000181 <= same as MSBNK-ACES_SU-AS000110, second name should be removed

  • MSBNK-ACES_SU-AS000121 <= the second name is wrong (it is the IUPAC name of Octocrylene and should be removed

CH$NAME: Methamidophos
CH$NAME: 2-ethylhexyl 2-cyano-3,3-diphenylprop-2-enoate

@meier-rene how do you wish to coordinate this?

@schymane
Copy link
Member

Oh and re this:

For some reason these records are not appearing in PubChem, I have reported this separately.

...Jeff found the issue and is reparsing our data, should be fixed soon.

@LiuLime
Copy link
Author

LiuLime commented Feb 23, 2024

Thank you very much for your quick response, Now I understand the partial rule for uploaded data.

Here I explain the way I compared:
In order to confirm the inchi ID is correct, I compared the origin recorded inchi with the database recorded inchi.

Database inchi is obtained by searching identifier (the priority is CAS > pubchem CID > Chebi ID > KEGG ID > name).
If origin recorded inchi is different with Database inchi, then I marked it as "inconsistent object".

I think the problem caused by two points:

    1. Because of my code, if multiple name appeared ,then the last recorded name will be used, this result in the inconsistent problem in the second case you mentioned (MSBNK-ACES_SU-AS000181 and MSBNK-ACES_SU-AS000121).
    1. The priority for searching identifier for me is NIH chemical identifier resolver > pubchem. This result In the first case, NIH result is a little bit different with pubchem.

Please see the result I collated in this excel.
ans.xlsx


Next I will re-analyze the data by using PubChem only instead of using NIH to search identifier, thank you very much for your kind explanation.

@schymane
Copy link
Member

Hi @LiuLime please note that as a structure-oriented database (since the mass spectra are connected to the structures) our MassBank validation procedures differ slightly from yours. Our highest priority goes to SMILES (the displayed structure comes directly from the SMILES), from there we check mass, formula, InChI and InChIKeys, and the database identifiers are secondary information and often retrieved and provided by contributors in a variety of ways. Some we can validate, but not all. Since CAS numbers are not public and the database requires a license, we cannot validate these automatically from the original source, so these should be taken with caution. Likewise, we are unable to validate ChemSpider identifiers as they do not provide an unlimited API. Often we have for instance people who provide the CAS of the standard (sometimes a salt form), but the structure itself is the neutral molecule as this is what the spectrum corresponds to. Hence the SMILES (and corresponding InChI) should be your reference where possible.
For the NAME field, you should give priority to the first entry, not subsequent ones, as this is also our display name and we give priority to this first field.
Please note we have extensive documentation and the validation code is here. You can also refer to our Record Specification to see the priority we give to entries (compulsory vs optional).

As an update, ACES have provided us with some fixes to their records and we are discussing how to implement - thanks for identifying the issues!

@tsufz
Copy link
Member

tsufz commented Feb 23, 2024

@LiuLime, Thanks so much for your issue reporting. Just to add to @schymane comments. The search on CAS is sometimes tricky as different CAS numbers are "true". Some databases provide the current active CAS number. CASfinder (of course) and also the US EPA Chemical Dashboard. Others may show other CAS, which are not wrong, but maybe outdated.

@schymane, this is an interesting topic. Do you know which CAS is provided by a PUG query?

(I cannot check for the given examples as MassBank is down).

@schymane
Copy link
Member

PubChem cannot return CAS directly, they can only provide CAS numbers that are given as synonyms and these are then depositor contributed (this is one of their examples).
image

The "Full Record Retrieval" does not appear to retrieve headings in the records to me, I guess this would have to be done via annotations somehow.
@PaulThiessen please correct me if I am wrong ...

@PaulThiessen
Copy link

Yeah it's complicated... Some CAS in PubChem from from depositor-supplied synonyms as Emma mentioned above. And some come from totally separate 3rd party annotation streams, e.g.

https://pubchem.ncbi.nlm.nih.gov/compound/2244#section=CAS

We don't have a way to get CAS-as-synonym specifically out of the list of depositor synonyms (though it's something we're thinking about how to do). You can get it from annotations like this, although it's a structured response...

https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/2244/JSON?heading=CAS

Note that PubChem's coverage of CAS is far from complete, and we aren't authoritative - there could be errors or conflicting results.

@LiuLime
Copy link
Author

LiuLime commented Feb 24, 2024

@schymane
Thank you so much for explanation and providing these links, they are helpful.
@tsufz @PaulThiessen
Thank you so much for discussion about CAS, this explains my confusion about CAS in records.
I have thought that records with CAS come from commercial standards which the agents would provide accurate CAS number to depositors, so it was quite unexpected when found out unusual examples of CAS.

It seems that for massbank data, it's better to use isomeric smiles instead of these database identifiers.


Another immature idea, how about using NCI/CADD chemical identifier solver to search for CAS?

Results could obtain by constructing query url.
Example

The potential issue is that NIH may differ from pubchem at some instances, and it also make sense to use only pubchem for every identifier query to keep consistency.

May I have your thoughts?

Have a great day!

@schymane
Copy link
Member

Indeed for MassBank it's better to rely on the structural data, as this is our focus. Our records are provided by a variety of contributors, who each use a variety of different ways to find and supply CAS - some take this from the standards, others from services like PubChem and CACTVS like you point out. One of our workflows, RMassBank, uses a combination of these (see here). Both CACTVS and PubChem are from NIH, but different sections (NCI vs NCBI) and both have the same issue with CAS - the only authoritative source is the CAS Registry which requires a license, which makes it difficult to verify any CAS numbers in a fully open science workflow such as MassBank (PubChem have the same issue). The CAS Common Chemistry set exists but it is only 500K compounds and does not cover all the compounds in MassBank.
What you exactly wish to do in your workflow is not clear to me and why you e.g. want to rely on CAS instead of other structural identifiers, we give these lower priority than other identifiers for the reasons above (but display them because users find them useful) and collaborate with PubChem and other resources to provide the best alternatives we can (more thoughts on identifiers in DOI: 10.1186/s13321-021-00520-4).
For our integration with PubChem we rely on SMILES, InChI, InChIKeys and thus have PubChem CIDs in PubChem corresponding to all our records (barring a few special cases). See DOI: 10.1039/D3EM00181D and the MassBank EU Data Source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants