The quickest way to start using Epitopedia is by downloading the docker container which contains all the dependencies preinstalled:
git clone https://github.com/cbalbin-bio/Epitopedia.git
docker pull cbalbin/epitopedia
Epitopedia requires the PDB in mmCIF format, EpitopediaDB and EPI-SEQ DB. EpitopediaDB and EPI-SEQ DB can be downloaded here.
To download the entirety of PDB in mmCIF format:
rsync -rlpt -v -z --delete --port=33444 \
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./mmCIF
OR
To download the only the PDB files present in EpitopediaDB (EPI-PDB) you can supply the pdb_id_list.txt to rsync:
rsync -rlpt -v -z --delete --port=33444 --include-from=/path/to/pdb_id_list.txt \
rsync.rcsb.org::ftp_data/structures/divided/mmCIF/ ./mmCIF
To run Epitopedia provide the paths to the various directories discussed below.
The data directory should contain Epitopedia DB (epitopedia.sqlite3) and EPI-SEQ (EPI-SEQ.fasta*) which can be downloaded here.
The mmcif directory should point to the sharded PDB directory in mmCIF format as downloaded above.
NOTE: you may need to unzip the mmCIF directory:
gunzip -r mmCIF
The output directory is where the output files will be written.
Replace the the paths on the left side of the colon with the actual absolute path on your local system. The paths on the right side of the colon are internal and should not be altered.
python3 Epitopedia/docker/run_epitopedia.py \
/Path/to/Output/Dir/ \
/Path/to/PDB/Dir/ \
/Path/to/Data/Dir/ \
--afdb-dir /Path/to/AlphaFold/Dir/ \
--taxid-filter 11118 --PDB-IDS 6VXX_A
NOTE: on some systems you may need to run docker with sudo.
It is recommended to use the flag taxid_filter to prevent the input protein from finding itself or other versions of itself. For example, if we wnted to find mimics of the SARS-CoV-2 spike protien (6VXX) is a SARS-CoV-2 protein we could use a taxid_filter of 11118 to prevent finding mimics in other Coronaviridae. The NCBI Taxonomy Browser will be helpful in determining what taxid to specify.
Epitopedia can run on multiple input structures to represent a conformational ensemble. To do so, simply provide a list of structures in the format PDBID_CHAINID as shown below.
run_epitopedia.py --PDB-IDS 6VXX_A 6VXX_B 6XR8_A 6XR8_B
Epitopedia defaults to a span length of 5, surface accesbility cutoff of 20% surface accesbility span legnth of 3, and no taxa filter, but these parameters can be set using the follow flags:
Flag | Description |
---|---|
--span | Minimum span length for a hit to progress |
--rasa | Cutoff for relative accessible surface area |
--rasa_span | Minimum consecutive accesssible residues to consider a hit a SeqBMM |
--taxid_filter | taxa filter; example to filter out all Coronaviridae --taxid_filter 11118 |
--rmsd | Max RMSD to still be considered a structural mimic |
--view | View results from a previous run |
--port | Port to be used by webserver |
--use-afdb | Include AFDB in search |
--pplddt | Minimum protein pLDDT score a structure predicted by alphafold must have to be considered |
--mplddt | Minimum average local pLDDT score a region predicted by alphafold must have to be considered |
Example output files 6VXX_A with a taxid_filt of 11118 as an input can be found here.
Definitions for the output file headers can be found here.
Epitopedia will output the following files at various stages of execution:
File Name | Description |
---|---|
EPI_SEQ_hits_{pdb_id(s)}.tsv | Contains the raw results from the BLAST search of the input structure against EPI-SEQ |
EPI_SEQ_span_filt_hits_{pdb_id(s)}.tsv | Contains hits with consecutive spans that meet the set minimum span length |
EPI_SEQ_span_filt_acc_hits_{pdb_id(s)}.tsv | Contains the above spans that contain the minimum span of accessible residues |
EPI_PDB_hits_{pdb_id(s)}.tsv" | Contains epitope source sequences against EPI_PDB hits |
EPI_PDB_fragment_pairs_{pdb_id(s)}.tsv | Contains structurally aligned fragment pairs consisting of spans of the input structure aligned against the structural representatives |
EPI_PDB_fragment_pairs_{pdb_id(s)}_ranked.tsv | Contains the above but ranked from best to worst RMSD |
Epitopedia will show the best hit per epitope motif if there are redundant source sequences at the final stage of the execution. There results can be viewed in a tsv file (Example) or a more legible HTML file (Example).
Epitopedia uses IEDB and PDB to generate EpitopediaDB, which is used in the molecular mimicry search.
Generation of the database takes some time (~12 hours). Thus, the EpitopediaDB is provided above.
To create the EpitopediaDB, download IEDB and a mmCIF version of PDB.
Point the container to the approriate paths for the IEDB, PDB (mmCIF format) and a data directory where the databases will be written.
docker run --rm -it \
-v /Path/To/iedb_public.sql:/app/iedb \
-v /Path/to/mmCIF/Dir/:/app/mmcif \
-v /Path/to/Data/Dir/:/app/data \
cbalbin/epitopedia generate_database.py
This software is released under the MIT License.
Software and databases used in Epitopedia may be released under various licenses:
Software:
- DSSP
- TM-align
- NCBI BLAST
- Entrez
- MMseqs2
- mysql2sqlite
- sqlite
- Docker
- Flask
- gemmi
- rich
- biopython
- dataclasses-json
- python
Databases:
If you use Epitopedia in your work, please cite:
Epitopedia: identifying molecular mimicry of known immune epitopes
Christian Andrew Balbin, Janelle Nunez-Castilla, Jessica Siltberg-Liberles
bioRxiv 2021.08.26.457577; doi: https://doi.org/10.1101/2021.08.26.457577