Advanced SPARQL for nextprot with spring-mvc, jena and virtuoso

NOTE: the public access of nextprot sparql is scheduled for 2014

Advanced SPARQL for nextprot with spring-mvc, jena and virtuoso

The purpose of this document is to give an original way to build and test the new advanced search engine for neXtProt. neXtProt is an on-line knowledge platform on human proteins. It is based on a top-down data integration process, materialized in a central SQL engine (postgres). neXtProt tends to integrate, with a top-down process, a large amount of data provided by independant groups (bottom-up process). Currently, all neXtProt's data can't be easily interrogated because of the lack of an advanced query engine. The nature of bioinformatics data makes this features difficult to achieve. Data are highly interconnected and are difficult to be normalized without adding useless of complexity. This project proposes a solution to build an advanced query engine, based on the use cases provided by our (main) users. We currently have 91 queries that describe all perspectives of data for the first release. This is our first milestone, it mainly focuses on those queries.

This project will help to build a closed world RDF schema by iterations and tests. The schema creation mainly focuses on the user queries. It has nothing to do with semantic data in open world. It emphasizes on understandable SPARQL queries.

For example, All proteins which are located in mitochondrion with an evidence other than HPA and DKFZ-GFP

  ?proteins :isoform/:localisation ?statement.
    ?statement :in/:childOf term:SL-0173 #Mitochondrion ; 
               :withEvidence/:fromXref/:notIn :HPA,:DKFZ-GFP

Or you can get all regions/domains (eg. name, positions and terminology) of a protein

  entry:NX_P06213 :isoform/:region ?region.
    ?region rdfs:comment ?name;:start ?start;:end ?end;:in ?term

and plot the result

This project also demonstrates how to use and configure a triplestore (open-virtuoso, fuseki) with Jena and spring-mvc. Following the instructions, you should be able to build your own neXtProt mirror

###Use our public sparql endpoint

TODO (Oauth and public key)

###Get your own triplestore instance

install open-virtuoso 7.x (redhat, ubuntu),
get the neXtProt triples,
install virtuoso jena driver (download Jena2 provider and jdbc4 jars),

$mvn install:install-file -Dfile=virt_jena2.jar -DgroupId=virtuoso.jena2 -DartifactId=virtuoso-jena2 -Dversion=2.10.x
$mvn install:install-file -Dfile=virtjdbc4.jar -DgroupId=virtuoso.jdbc4 -DartifactId=virtuoso-jdbc4 -Dversion=4.0

Configure triplestore endpoint
- in file main.properties configure your own virtuoso instance or use the public nextprot sparql endpoint
- if you dont have a virtuoso instance, you can use the public access of nextprot sparql. To do that, you have to uncomment the variable 'sparql.endpoint' in the config/main.properties
- NOTE: the public access of nextprot sparql is scheduled for 2014

###Test your configuration: run a single TestClasse

$mvn -Dtest=Integrity test

###Run all rdf tests

view all sparql test

$mvn -Dtest=evaletolab.rdf.* test

###Walking the graph The class SparqlController.java implement the basic proxying with the triplestore. With a native Jena2 driver, you have the ability to mix, in a single SPARQL query, data from your native datastore and magic properties from Jena ARQ.

$ mvn jetty:run
$ ff localhost:6969

###Use case for evidence

Q27 with >=1 glycosylation sites reported in PubMed:X or PubMed:Y
Q53 which are involved in cell adhesion according to GO with an evidence not IAE and not ISS
Q57 which are located in mitochondrion with an evidence other than HPA and DKFZ-GFP
Q63 which have >=1 RRM RNA-binding domain and either no GO "RNA binding" other a GO "RNA binding" with evidence IEA or ISS

###Use case for expression

QX Proteins that are not highly expressed in liver at embrion stage
Q4 highly expressed in brain but not expressed in testis
Q11 that are expressed in liver and involved in transport
Q15 with a PDZ domain that interact with at least 1 protein which is expressed in brain
Q17 >=1000 amino acids and located in nucleus and expression in nervous system
Q20 with >=2 HPA antibodies whose genes are located on chromosome 21 and that are highly expressed at IHC level in heart
Q50 which are expressed in brain according to IHC but not expressed in brain according to microarray
Q77 which are expressed in liver according to IHC data but not found in HUPO liver proteome set
Q83 whose genes are on chromosome N that are expressed only a single tissue/organ
Q89 which are located in nucleus and expressed in brain and only have orthologs/paralogs in primates

###Use case for sequence

Q3 Proteins with >=2 transmembrane regions
Q5 Proteins located in mitochondrion and that lack a transit peptide
Q9 Proteins with 3 disulfide bonds and that are not hormones
Q13 Proteins with a protein kinase domain but no kinase activity
Q14 Proteins with 2 SH3 domains and 1 SH2 domain
Q15 Proteins with a PDZ domain that interact with at least 1 protein which is expressed in brain
Q16, Q16a Q16b, Proteins with a mature chain <= 100 amino acids which are secreted and do not contain cysteines in the mature chain
Q18 Proteins that are acetylated and methylated and located in the nucleus
Q19 Proteins contains a signal sequence followed by a extracellular domain containing a "KRKR" motif
Q27 Proteins with >=1 glycosylation sites reported in PubMed:X or PubMed:Y
Q32 Proteins with a coiled coil region and involved in transcription but does not contain a bZIP domain
Q34 Proteins with >=1 homeobox domain and with >=1 variant in the homeobox domain(s)
Q35 Proteins located in the mitochondrion and which is an enzyme
Q38 Proteins with >=1 selenocysteine in their sequence
Q39 Proteins with >=1 mutagenesis in a position that correspond to an annotated active site
Q40 Proteins that are enzymes and with >=1 mutagenesis that "decrease" or "abolish" activity
Q41 Proteins that are annotated with GO "F" terms prefixed by "Not"
Q48 Proteins with >=1 variants of the type "C->" (Cys to anything else) that are linked to >=1 disease
Q49 Proteins with >=1 variants of the types "A->R" or "R->A"

###Use case for general interaction

Q24 Proteins with >1 reported gold interaction
Q25 Proteins with >=50 interactors and not involved in a disease
Q26 Proteins interacting with >=1 protein located in the mitochondrion

###Use case for general annotation

Q1 Proteins that are phosphorylated and located in the cytoplasm
Q2 Proteins that are located both in the cytoplasm and in the nucleus
Q5 Proteins located in mitochondrion and that lack a transit peptide
Q6 Proteins whose genes are on chromosome 2 and linked with a disease
Q7 Proteins linked to diseases that are associated with cardiovascular aspects
Q8 Proteins whose genes are x bp away from the location of the gene of protein Y
Q22 Proteins with no function annotated
Q31 Proteins with >=10 "splice" isoforms
Q30 Proteins whose gene is located in chromosome 2 that belongs to families with >=5 members in the human proteome
Q32 Proteins with a coiled coil region and involved in transcription but does not contain a bZIP domain
Q47 Proteins with a gene name CLDN*
Q64 Proteins which are enzymes with an incomplete EC number
Q68 Proteins with protein evidence PE=2 (transcript level)
Q65 Proteins with >1 catalytic activity
Q73 Proteins with no domain

###Use case for Xref queries

Q72 Proteins with a cross-reference to CCDS
Q107All proteins with a protein evidence not "At protein level" with a HGNC identifier/xref that includes the regexp "orf"

###Use case for Gene queries

Q55 which have genes of length >=10000 bp
Q58 which are located on the genome next to a protein_which is involved in spermatogenesis righ

###Use case for 3Dstructure queries

Q108 All proteins that have a 3D structure in PDB that overlap by at least 50 amino acids with a SH3 domain.
Q81 Proteins with >=1 3D structure and are located in the mitochondrion and are linked with a disease

###Use case for Peptide queries

Q75 Proteins which have been detected in the HUPO liver proteome set but not the HUPO plasma proteome set
Q109 All proteins that have a peptide that maps partly or fully into a signal sequence

###Use case for PTM queries

Q10 Proteins that are glycosylated and not located in the membrane
Q66 Proteins that are cytoplasmic with alternate O-glycosylation or phosphorylation at the same positions
Q67 Proteins with alternative acetylation or Ubl conjugation (SUMO or Ubiquitin) at the same positions

###Federated queries

Q95 which are targets of antibiotics - federated query with drugbank -

It is compatible with tomcat and jetty maven plugins.

Use mvn tomcat7:run or mvn jetty:run

Some sample controller (for SPARQL query provider and jena test) for proteins Expression are also provided.

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Advanced SPARQL for nextprot with spring-mvc, jena and virtuoso

About

Releases

Packages

Languages

evaletolab/spring-jena-sparql

Folders and files

Latest commit

History

Repository files navigation

Advanced SPARQL for nextprot with spring-mvc, jena and virtuoso

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages