NOTE: the public access of nextprot sparql is scheduled for 2014
The purpose of this document is to give an original way to build and test the new advanced search engine for neXtProt. neXtProt is an on-line knowledge platform on human proteins. It is based on a top-down data integration process, materialized in a central SQL engine (postgres). neXtProt tends to integrate, with a top-down process, a large amount of data provided by independant groups (bottom-up process). Currently, all neXtProt's data can't be easily interrogated because of the lack of an advanced query engine. The nature of bioinformatics data makes this features difficult to achieve. Data are highly interconnected and are difficult to be normalized without adding useless of complexity. This project proposes a solution to build an advanced query engine, based on the use cases provided by our (main) users. We currently have 91 queries that describe all perspectives of data for the first release. This is our first milestone, it mainly focuses on those queries.
This project will help to build a closed world RDF schema by iterations and tests. The schema creation mainly focuses on the user queries. It has nothing to do with semantic data in open world. It emphasizes on understandable SPARQL queries.
For example, All proteins which are located in mitochondrion with an evidence other than HPA and DKFZ-GFP
?proteins :isoform/:localisation ?statement.
?statement :in/:childOf term:SL-0173 #Mitochondrion ;
:withEvidence/:fromXref/:notIn :HPA,:DKFZ-GFP
Or you can get all regions/domains (eg. name, positions and terminology) of a protein
entry:NX_P06213 :isoform/:region ?region.
?region rdfs:comment ?name;:start ?start;:end ?end;:in ?term
This project also demonstrates how to use and configure a triplestore (open-virtuoso, fuseki) with Jena and spring-mvc. Following the instructions, you should be able to build your own neXtProt mirror
###Use our public sparql endpoint
- TODO (Oauth and public key)
###Get your own triplestore instance
- install open-virtuoso 7.x (redhat, ubuntu),
- get the neXtProt triples,
- install virtuoso jena driver (download Jena2 provider and jdbc4 jars),
$mvn install:install-file -Dfile=virt_jena2.jar -DgroupId=virtuoso.jena2 -DartifactId=virtuoso-jena2 -Dversion=2.10.x
$mvn install:install-file -Dfile=virtjdbc4.jar -DgroupId=virtuoso.jdbc4 -DartifactId=virtuoso-jdbc4 -Dversion=4.0
- Configure triplestore endpoint
- in file main.properties configure your own virtuoso instance or use the public nextprot sparql endpoint
- if you dont have a virtuoso instance, you can use the public access of nextprot sparql. To do that, you have to uncomment the variable 'sparql.endpoint' in the config/main.properties
- NOTE: the public access of nextprot sparql is scheduled for 2014
###Test your configuration: run a single TestClasse
$mvn -Dtest=Integrity test
###Run all rdf tests
$mvn -Dtest=evaletolab.rdf.* test
###Walking the graph The class SparqlController.java implement the basic proxying with the triplestore. With a native Jena2 driver, you have the ability to mix, in a single SPARQL query, data from your native datastore and magic properties from Jena ARQ.
$ mvn jetty:run
$ ff localhost:6969
###Use case for evidence
- Q27 with >=1 glycosylation sites reported in PubMed:X or PubMed:Y
- Q53 which are involved in cell adhesion according to GO with an evidence not IAE and not ISS
- Q57 which are located in mitochondrion with an evidence other than HPA and DKFZ-GFP
- Q63 which have >=1 RRM RNA-binding domain and either no GO "RNA binding" other a GO "RNA binding" with evidence IEA or ISS
###Use case for expression
- QX Proteins that are not highly expressed in liver at embrion stage
- Q4 highly expressed in brain but not expressed in testis
- Q11 that are expressed in liver and involved in transport
- Q15 with a PDZ domain that interact with at least 1 protein which is expressed in brain
- Q17 >=1000 amino acids and located in nucleus and expression in nervous system
- Q20 with >=2 HPA antibodies whose genes are located on chromosome 21 and that are highly expressed at IHC level in heart
- Q50 which are expressed in brain according to IHC but not expressed in brain according to microarray
- Q77 which are expressed in liver according to IHC data but not found in HUPO liver proteome set
- Q83 whose genes are on chromosome N that are expressed only a single tissue/organ
- Q89 which are located in nucleus and expressed in brain and only have orthologs/paralogs in primates
###Use case for sequence
- Q3 Proteins with >=2 transmembrane regions
- Q5 Proteins located in mitochondrion and that lack a transit peptide
- Q9 Proteins with 3 disulfide bonds and that are not hormones
- Q13 Proteins with a protein kinase domain but no kinase activity
- Q14 Proteins with 2 SH3 domains and 1 SH2 domain
- Q15 Proteins with a PDZ domain that interact with at least 1 protein which is expressed in brain
- Q16, Q16a Q16b, Proteins with a mature chain <= 100 amino acids which are secreted and do not contain cysteines in the mature chain
- Q18 Proteins that are acetylated and methylated and located in the nucleus
- Q19 Proteins contains a signal sequence followed by a extracellular domain containing a "KRKR" motif
- Q27 Proteins with >=1 glycosylation sites reported in PubMed:X or PubMed:Y
- Q32 Proteins with a coiled coil region and involved in transcription but does not contain a bZIP domain
- Q34 Proteins with >=1 homeobox domain and with >=1 variant in the homeobox domain(s)
- Q35 Proteins located in the mitochondrion and which is an enzyme
- Q38 Proteins with >=1 selenocysteine in their sequence
- Q39 Proteins with >=1 mutagenesis in a position that correspond to an annotated active site
- Q40 Proteins that are enzymes and with >=1 mutagenesis that "decrease" or "abolish" activity
- Q41 Proteins that are annotated with GO "F" terms prefixed by "Not"
- Q48 Proteins with >=1 variants of the type "C->" (Cys to anything else) that are linked to >=1 disease
- Q49 Proteins with >=1 variants of the types "A->R" or "R->A"
###Use case for general interaction
- Q24 Proteins with >1 reported gold interaction
- Q25 Proteins with >=50 interactors and not involved in a disease
- Q26 Proteins interacting with >=1 protein located in the mitochondrion
###Use case for general annotation
- Q1 Proteins that are phosphorylated and located in the cytoplasm
- Q2 Proteins that are located both in the cytoplasm and in the nucleus
- Q5 Proteins located in mitochondrion and that lack a transit peptide
- Q6 Proteins whose genes are on chromosome 2 and linked with a disease
- Q7 Proteins linked to diseases that are associated with cardiovascular aspects
- Q8 Proteins whose genes are x bp away from the location of the gene of protein Y
- Q22 Proteins with no function annotated
- Q31 Proteins with >=10 "splice" isoforms
- Q30 Proteins whose gene is located in chromosome 2 that belongs to families with >=5 members in the human proteome
- Q32 Proteins with a coiled coil region and involved in transcription but does not contain a bZIP domain
- Q47 Proteins with a gene name CLDN*
- Q64 Proteins which are enzymes with an incomplete EC number
- Q68 Proteins with protein evidence PE=2 (transcript level)
- Q65 Proteins with >1 catalytic activity
- Q73 Proteins with no domain
###Use case for Xref queries
- Q72 Proteins with a cross-reference to CCDS
- Q107All proteins with a protein evidence not "At protein level" with a HGNC identifier/xref that includes the regexp "orf"
###Use case for Gene queries
- Q55 which have genes of length >=10000 bp
- Q58 which are located on the genome next to a protein_which is involved in spermatogenesis righ
###Use case for 3Dstructure queries
- Q108 All proteins that have a 3D structure in PDB that overlap by at least 50 amino acids with a SH3 domain.
- Q81 Proteins with >=1 3D structure and are located in the mitochondrion and are linked with a disease
###Use case for Peptide queries
- Q75 Proteins which have been detected in the HUPO liver proteome set but not the HUPO plasma proteome set
- Q109 All proteins that have a peptide that maps partly or fully into a signal sequence
###Use case for PTM queries
- Q10 Proteins that are glycosylated and not located in the membrane
- Q66 Proteins that are cytoplasmic with alternate O-glycosylation or phosphorylation at the same positions
- Q67 Proteins with alternative acetylation or Ubl conjugation (SUMO or Ubiquitin) at the same positions
###Federated queries
- Q95 which are targets of antibiotics - federated query with drugbank -
It is compatible with tomcat and jetty maven plugins.
Use mvn tomcat7:run or mvn jetty:run
Some sample controller (for SPARQL query provider and jena test) for proteins Expression are also provided.