Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support SureChembl external link URI patterns #14

Open
stain opened this issue Oct 12, 2015 · 8 comments
Open

Support SureChembl external link URI patterns #14

stain opened this issue Oct 12, 2015 · 8 comments
Assignees
Milestone

Comments

@stain
Copy link
Contributor

stain commented Oct 12, 2015

From https://wiki.openphacts.org/index.php/SureChEMBL

Compounds are assigned SureChEMBL identifiers as used in the SureChEMBL interface and download files. Please note these identifiers have no relation to ChEMBL identifiers, but the UniChem system can be used to cross-reference the two. The URIs provided take the following form:

http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL15064

Please note that SureChEMBL molecules are not yet loaded in the Open PHACTS chemical registry, so cannot currently be retrieved via OCRS IDs.

Targets are identified by HGNC symbols with URIs of the form:

http://rdf.ebi.ac.uk/resource/surechembl/target/FDFT1

Mappings from HGNC symbols to other gene/protein identifiers are available via the IMS through Ensembl linksets.

Diseases are identified by MeSH disease identifiers with URIs of the form:

http://rdf.ebi.ac.uk/resource/surechembl/indication/D009765

Mappings to UMLS and Disease Ontology (DO) are available via DisGeNET link sets in the IMS. It should be noted that not all MeSH identifiers currently map to a disease in DO.

Patents are uniquely identified by patent numbers in a defined format. This should be the patent office code (e.g., EP, WO or US) followed by a hyphen, the patent number (no leading zeros), another hyphen and finally the kind code (e.g., A1, B2). The SureChEMBL interface provides a service to standardise and resolve other formats of patent numbers.

URIs take the form:

http://rdf.ebi.ac.uk/resource/surechembl/patent/EP-1339685-A2

@stain
Copy link
Contributor Author

stain commented Oct 15, 2015

I think we need:

@stain stain added this to the 2.2 milestone Oct 15, 2015
@stain stain self-assigned this Oct 15, 2015
@stain
Copy link
Contributor Author

stain commented Oct 15, 2015

Perhaps @agaulton can check what is the URI pattern for identifiers that DON'T match MeSH or HGNC to ensure they don't accidentally get mapped by IMS - e.g. don't match their regular expressions.

@agaulton
Copy link

On the mesh identifiers.org URIs:

  • URLs of the form http://identifiers.org/mesh/D009765 don't seem to resolve
  • Identifiers.org don't seem to list this format - they have identifiers.org/mesh.2012/ or identifiers.org/mesh.2013/
  • The MeSH website seems to use a different identifier by default (e.g., 9338 for D009765 which is diabetes), but identifiers.org/mesh.2013/9338 doesn't work either
  • I think the correct MeSH URL to redirect to is: http://www.nlm.nih.gov/cgi/mesh/2015/MB_cgi?field=uid&term=D009765
  • The identifiers.org regular expression seems a bit too permissive to be useful
    Maybe I'm missing something... I'll ask Nick Juty...

Anyway, this probably doesn't matter too much if we're only using it as an ID.

In the SureChEMBL dataset, MeSH disease IDs match the pattern: ^D0[0-9]{5}$
MeSH supplementary concept terms match the pattern: ^C5[0-9]{5}$
Custom SciBite disease IDs match the pattern: ^DX[0-9]{5}$

So they could be distinguished, but not by the identifiers.org pattern.

@agaulton
Copy link

On the gene symbols, we have ~10 of these that are not genuine HGNC symbols. These have the form: ^_[A-Za-z0-9]+$

@agaulton
Copy link

On the unichem chembl-surechembl linkset, we already have one as part of the chembl_20 release that should work: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/20.1/chembl_20.1_unichem.ttl.gz

SureChEMBL URIs are different (since this pre-dated the SureChEMBL RDF) so pattern would need mapping in the IMS

@agaulton
Copy link

Correction - this unichem data is not currently a link set, but could be converted...

@agaulton
Copy link

Update - Nick has fixed the issue with identifiers.org so the mesh links now work again. He has also made the regex less permissive: ^(C|D)0\d{5}$ So it should be possible to distinguish the SciBite IDs now.

@danidi
Copy link
Contributor

danidi commented Mar 24, 2016

Compound mappings to OCRS are now available, e.g. for http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL15064. Disease and Gene_ID patterns seem to be missing still.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants