-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two different chemical names appear in the file and point to different molecular entities #255
Comments
Thank you for the feedback about these records... @meier-rene I can confirm that it seems MSBNK-Athens_Univ-AU113805 and MSBNK-Athens_Univ-AU113804 should have the CAS number corrected to 20574-50-9. The CAS number appears to be correct in MSBNK-Athens_Univ-AU113801. Do you wish to do this update your side? For some reason these records are not appearing in PubChem, I have reported this separately, but the CAS is aligned with the CAS on PubChem where the equivalent MoNA records appear as well. I will need more time to check the name issue. |
Regarding the names in the ACESx spectra, some look OK to me, some look like they need fixing. If you disagree with my assessment can you please provide more exact details in your report so we know what exact issue you mean? Records that you flagged that seem OK to me
Records that you flagged that seem to have errors that need fixing
@meier-rene how do you wish to coordinate this? |
Oh and re this:
...Jeff found the issue and is reparsing our data, should be fixed soon. |
Thank you very much for your quick response, Now I understand the partial rule for uploaded data. Here I explain the way I compared: Database inchi is obtained by searching identifier (the priority is CAS > pubchem CID > Chebi ID > KEGG ID > name). I think the problem caused by two points:
Please see the result I collated in this excel. Next I will re-analyze the data by using PubChem only instead of using NIH to search identifier, thank you very much for your kind explanation. |
Hi @LiuLime please note that as a structure-oriented database (since the mass spectra are connected to the structures) our MassBank validation procedures differ slightly from yours. Our highest priority goes to SMILES (the displayed structure comes directly from the SMILES), from there we check mass, formula, InChI and InChIKeys, and the database identifiers are secondary information and often retrieved and provided by contributors in a variety of ways. Some we can validate, but not all. Since CAS numbers are not public and the database requires a license, we cannot validate these automatically from the original source, so these should be taken with caution. Likewise, we are unable to validate ChemSpider identifiers as they do not provide an unlimited API. Often we have for instance people who provide the CAS of the standard (sometimes a salt form), but the structure itself is the neutral molecule as this is what the spectrum corresponds to. Hence the SMILES (and corresponding InChI) should be your reference where possible. As an update, ACES have provided us with some fixes to their records and we are discussing how to implement - thanks for identifying the issues! |
@LiuLime, Thanks so much for your issue reporting. Just to add to @schymane comments. The search on CAS is sometimes tricky as different CAS numbers are "true". Some databases provide the current active CAS number. CASfinder (of course) and also the US EPA Chemical Dashboard. Others may show other CAS, which are not wrong, but maybe outdated. @schymane, this is an interesting topic. Do you know which CAS is provided by a PUG query? (I cannot check for the given examples as MassBank is down). |
PubChem cannot return CAS directly, they can only provide CAS numbers that are given as synonyms and these are then depositor contributed (this is one of their examples). The "Full Record Retrieval" does not appear to retrieve headings in the records to me, I guess this would have to be done via annotations somehow. |
Yeah it's complicated... Some CAS in PubChem from from depositor-supplied synonyms as Emma mentioned above. And some come from totally separate 3rd party annotation streams, e.g. https://pubchem.ncbi.nlm.nih.gov/compound/2244#section=CAS We don't have a way to get CAS-as-synonym specifically out of the list of depositor synonyms (though it's something we're thinking about how to do). You can get it from annotations like this, although it's a structured response... https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/2244/JSON?heading=CAS Note that PubChem's coverage of CAS is far from complete, and we aren't authoritative - there could be errors or conflicting results. |
@schymane It seems that for massbank data, it's better to use isomeric smiles instead of these database identifiers. Another immature idea, how about using NCI/CADD chemical identifier solver to search for CAS? Results could obtain by constructing query url. The potential issue is that NIH may differ from pubchem at some instances, and it also make sense to use only pubchem for every identifier query to keep consistency. May I have your thoughts? Have a great day! |
Indeed for MassBank it's better to rely on the structural data, as this is our focus. Our records are provided by a variety of contributors, who each use a variety of different ways to find and supply CAS - some take this from the standards, others from services like PubChem and CACTVS like you point out. One of our workflows, RMassBank, uses a combination of these (see here). Both CACTVS and PubChem are from NIH, but different sections (NCI vs NCBI) and both have the same issue with CAS - the only authoritative source is the CAS Registry which requires a license, which makes it difficult to verify any CAS numbers in a fully open science workflow such as MassBank (PubChem have the same issue). The CAS Common Chemistry set exists but it is only 500K compounds and does not cover all the compounds in MassBank. |
Hello,
When I am cleaning the massbank data, I found that:
(1) Confusing name records
In MSBNK-ACES_SU institute provided datasets, there are suspicious data records. They provided two different chemical names in files which represented totally different moleculars.
According to exactmass and formula in the example, I personally judged that the first name record is correct, the second one, for some reason, maybe uploaded by false conduction.
Here are a few suspicious records I found (more may exist):
(2) In MASSBANK_Athens_Univ record, CAS number may indicate different form to the molecular author may want to upload.
The blue arrow indicate that uploaded CAS number searched result, and red arrow indicate the correct CAS number I think.
Cause CAS number is very useful for characterizing molecules precisely, especially for distinguishing isomers (better than inchi or inchikey which sometimes cannot distinguish isomers with different conformation), it would be helpful if they are correct :)
Thank you very much!
The text was updated successfully, but these errors were encountered: