Skip to content

R multiple sequence alignment tool for conservation analysis of proteins listed in uniprot

License

Notifications You must be signed in to change notification settings

d0minicO/unicoRn

Repository files navigation

UnicoRn

R multiple sequence alignment tool for protein conservation analysis

UnicoRn is a custom R function that takes gene names and outputs a pdf for each gene showing a pretty alignment of the amino acid sequence for that gene across multiple predefined species (example output)


Install

Part 1 -- install package dependencies if you have not already got them

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("biomaRt")
BiocManager::install("msa")
BiocManager::install("Biostrings")
install.packages("seqinr")
install.packages("tidyverse")
install.packages("data.table")
install.packages("tools")
install.packages("devtools")
install.packages("tinytex")

Check you can load all these libraries successfully. If you see no errors then you are good to go!

library(msa)
library(Biostrings)
library(biomaRt)
library(seqinr)
library(tidyverse)
library(data.table)
library(tinytex)
library(tools)
library(devtools)

Part 2 -- source unicoRn from github

devtools::source_url("https://github.com/d0minicO/unicoRn/blob/main/unicoRn.R?raw=TRUE")

Arguments

  • base = route / working directory (eg. "C:/user/baloons/")
  • subs name = name of this list of genes to save in their own folder (eg. "UnicoRn_analysis")
  • genes = character string or vector eg "Gene", or c("Gene1", "Gene2")
  • len (optional) = the length of the first how many AAs to plot (or if you just want the whole length make len="whole" or any characters will do) --- default is full length
  • speciesToUse (optional) = character string of the species to use -- these must be formatted in the way that ensembl recognises them and separated with a boolean or (eg. "hsapiens|mmusculus|clfamiliaris") will get sequences for human, mouse, and dog --- default is to look for human, mouse, dog, opossum, chicken, frog, zebrafish, xenopus, and drosophila sequences. Will only use species where a gene/sequence was found
  • returnData (optional) = TRUE or FALSE. If TRUE, then return a data.frame (tibble) of the uniprot IDs, species, and sequences. If false, then perform alignment and plotting. Returning the dataframe can be useful for other downstream purposes, or just to match a long list of genes to uniprot IDs. --- default is not to return a data frame but to perform alignment of each gene and plotting

Example usage

Basic usage

unicoRn(base="C:/user/baloon/unicoRn_Analysis/",
        subs_name="UnicoRn_example",
        genes=c("DDX50","DDX21"))

Output

Example image

See https://github.com/d0minicO/unicoRn/blob/main/example_output_files/ for examples of the .fasta alignment, .tex file, and alignment.pdf


How does it work?

UnicoRn uses the powerful biomaRt package (https://bioconductor.org/packages/release/bioc/html/biomaRt.html) to locate gene names across species and link them to a uniprot ID. Then, the curated uniprot database (https://www.uniprot.org/) is queried to extract the amino acid sequences for those proteins. The package msa (https://bioconductor.org/packages/release/bioc/html/msa.html) is used to generate a multiple sequence alignment. The alignment is saved (.fasta format) and a custom .tex file is created utilising the texshade package (https://www.ctan.org/pkg/texshade) inspired by this example (https://www.overleaf.com/latex/templates/standalone-msa-figure/rbgrxrmctccc). Finally, the tinytex implimentation of latex (https://www.rdocumentation.org/packages/tinytex/versions/0.32) is used to compile and save a pdf of the alignment.


Limitations

A limitation of this function is that it only takes the canonical sequence for each Uniprot entry. This means that if you know you require a specific protein isoform, the canonical sequence may not be suitable. In many cases, however, the canonical sequence is acceptable. It is defined by uniprot as follows:

"To reduce redundancy, the UniProtKB/Swiss-Prot policy is to describe all the protein products encoded by one gene in a given species in a single entry. We choose for each entry a canonical sequence based on at least one of the following criteria:

  • It is the most prevalent.
  • It is the most similar to orthologous sequences found in other species.
  • By virtue of its length or amino acid composition, it allows the clearest description of domains, isoforms, genetic variation, post-translational modifications, etc.
  • In the absence of any information, we choose the longest sequence."

Troubleshooting

dominic.owens ..at?? utoronto.ca

Tinytex

Latex must be installed and communicating properly with R in order to compile and save the pdf. Tinytex (https://www.rdocumentation.org/packages/tinytex/versions/0.32) is a fairly straightforward way to get Rstudio and latex to communicate. If the pdf output step fails then it is likely that Rstudio is not able to communicate properly with tinytex/latex. You can try testing out the following code to see where the problem lies

## See if tinytex can compile this basic test pdf
writeLines(c(
   '\\documentclass{article}',
   '\\begin{document}',
   "Help, get it to work!",
   '\\end{document}'
 ), "test.tex")
tinytex::pdflatex("test.tex")

# now test if tinytex can properly use texshade
## first make an example alignment file
writeLines(c(
  ">DDX21_Human",
  "MPGKLRSDAG--",
  ">DDX21_Mouse",
  "MPGKLRSGAK--",
  ">DDX21_Dog",
  "MPGKLLSDAG--",
  ">DDX21_Zebrafish",
  "--EKWQDSRRWT"),
  "example.fasta")

## then write a minimal .tex file
writeLines(c(
  '\\documentclass{article}',
  "\\usepackage{texshade}",
  '\\begin{document}',
  "\\begin{texshade}",
  "{example.fasta}",
  "\\end{texshade}",
  '\\end{document}'
), "example.tex")

# test out compiling
tinytex::pdflatex("example.tex")

## if this doesn't work try this method to compile instead
## This is what unicoRn actually uses as tinytex::pdflatex has been buggy for me
tools::texi2pdf("example.tex", clean=TRUE)

If these didn't both work, then you will have to try make sure tinytex and texshade are properly installed

# reinstall the tinytex package
install.packages("tinytex")
library(tinytex)

# install the full version of the package
tinytex:::install_prebuilt('TinyTeX')

# list installed packages available to tinytex
# check that "texshade" is listed
tl_pkgs()

# install a lot of the additional common packages
tinytex:::install_yihui_pkgs()

# force installation of texshade
tlmgr_install("texshade")

# If having problems use this to try to get more informative error messages
options(tinytex.verbose = TRUE)

## Alternative way to install tinytex with texshade
install_tinytex(
  force = FALSE,
  dir = "auto",
  version = "daily",
  repository = "ctan",
  extra_packages = "texshade",
  add_path = TRUE
)

For more help with tinytex/latex issues check out the developer's debugging instructions (https://yihui.org/tinytex/r/#debugging)

Deleted Uniprot IDs

If you see a warning like "Uniprot sequence not found for id. Ignoring this id." This means that one of the IDs you are trying to pull from Uniprot has been deleted recently.

It is annoying but Uniprot is a curated database meaning that they periodically delete old and/or spurious IDs. It is not trivial to solve this and the easiest thing to do for now is just to ignore that ID for the alignment, with the expectation that other species IDs will be found.

If you want to align specific IDs then you could try my other tool called alignR, that allows you to align any Uniprot IDs or custom sequences in the same way UnicoRn does.

About

R multiple sequence alignment tool for conservation analysis of proteins listed in uniprot

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published