This pipeline contains 4 steps. Step 1 and 2 only need to be done once for each genome; and you may get these results from other rescources. I will provide you all files needed for step 3 (identify SDR genes) and 4 (identify genes with high Fst). So you can jumpt directly to step 3.
Identify target genome's (S. purpurea pacbio annotaion v5.1/P. trichocharpa pacbio annotation v1.1) best bastp hit in Arabidopsis database (Araport11 annotation). This is a "quick-and-dirty" way to identify "ortholog". But in our case I think it works for the purpose.
$ makeblastdb -in Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa -dbtype prot -out Araport11.db
$ blastp -db Araport11.db -query Spurpurea94006v5.1.primaryTrs.pep.fa -out Spur_v5.1_to_Ara11.txt -evalue 1e-5 -outfmt 6 -max_target_seqs 1
$ blastp -db Araport11.db -query Ptrichocarpavar145v1.1b.primaryTrs.pep.fa -out Ptri_v1.1_to_Ara11.txt -evalue 1e-5 -outfmt 6 -max_target_seqs 1
I find these information from "Araport11_genes.201606.pep.fasta" from Tair website. Two things needs to be done: 1) extract the primary transcripts; 2) extract their annotation. There might be better ways to find these information. You may find a file contain these information already. Then you can skip this step.
perl extract_At_anno.pl
Code of the extract_At_anno.pl
#!/usr/bin/perl
%hash = ();
@seqs = ();
open FH, "<Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa";
while (<FH>)
{
if (/^>(\S+)/)
{
$name = $1;
$hash{$name} = 0;
}
}
close FH;
open FH, "<Araport11_genes.201606.pep.fasta";
open OUT, ">Araport11_gene_anno.txt";
while (<FH>)
{
$seq = $_;
chomp $seq;
if (/^>(\S+)/)
{
$gene = $1;
if (exists ($hash{$gene}))
{
@seqs = split(/\|/, $seq);
$seqs[0] =~ s/>//;
print OUT "$seqs[0]\t$seqs[1]\n";
}
}
}
close FH;
close OUT;
Four input files needed:
- SDR file in the format of a single line of "CHROMOSOME START_POSITION END_POSITION".
eg. Snig_SDR.txt <-- generated by SDR mapping pipeline
Chr07 4880000 6860000
- GFF file of the target genome.
eg. Spurpurea94006v5.1.gene.gff3 <-- download from genome annotation
##gff-version 3
##annot-version v5.1
##species Salix purpurea 94006
Chr01 JGI gene 8313 12196 . + . ID=Sa940v51000005m.g;Name=Sa940v51000005m.g
Chr01 JGI mRNA 8328 10828 . + . ID=PAC4GC:37960894;Name=Sa940v51000005m;longest=1;Parent=Sa940v51000005m.g
Chr01 JGI five_prime_UTR 8328 8481 . + . ID=PAC4GC:37960894.five_prime_UTR.1;Parent=PAC4GC:37960894
Chr01 JGI CDS 8482 8511 . + 0 ID=PAC4GC:37960894.CDS.1;Parent=PAC4GC:37960894
Chr01 JGI CDS 8705 8771 . + 0 ID=PAC4GC:37960894.CDS.2;Parent=PAC4GC:37960894
Chr01 JGI CDS 9974 10069 . + 2 ID=PAC4GC:37960894.CDS.3;Parent=PAC4GC:37960894
Chr01 JGI CDS 10164 10279 . + 2 ID=PAC4GC:37960894.CDS.4;Parent=PAC4GC:37960894
- Ortholog file in the format of for each line the first two columns are "Target_Genome_Gene_Name Arabidopsis_Ortholog_Gene_Name".
eg. Spur_b5.1_to_Ara11.txt <-- generated by step 1
Sa940v51059915m AT2G17930.1 84.934 677 102 0 1 677 1341 2017 0.0 1152
Sa940v51061191m AT3G24880.1 68.182 110 35 0 3 112 535 644 5.62e-49 169
Sa940v51061190m AT3G24880.1 58.515 229 94 1 1 229 712 939 2.74e-81 265
Sa940v51061192m AT3G60860.1 77.802 464 103 0 15 478 398 861 0.0 751
Sa940v51061189m AT3G24870.1 64.474 76 27 0 2 77 990 1065 5.95e-26 99.4
Sa940v51061138m AT5G19820.1 47.619 84 44 0 22 105 988 1071 5.18e-22 89.7
- Arabidopsis gene annotation in the format of each line start with the gene name followed by description.
eg. Araport11_gene_anno.txt <-- generated by step 2
AT1G01010.1 NAC domain containing protein 1
AT1G01020.1 ARV1 family protein
AT1G01030.1 AP2/B3-like transcriptional factor family protein
AT1G01040.2 dicer-like 1
AT1G01050.1 pyrophosphorylase 1
To run the job:
perl extract_gene_SDG_anno_willow.pl Snig_SDR.txt
Output file "Willow_SDR_gene_anno_output.txt" contains the following:
Locus Gene At_ortholog Annotation
Chr07:4885351_4889304 Sa940v51022368m AT4G36050.2 endonuclease/exonuclease/phosphatase family protein
Chr07:4898456_4905398 Sa940v51022369m AT5G66210.1 calcium-dependent protein kinase 28
Chr07:4933756_4962623 Sa940v51022373m AT2G17930.1 Phosphatidylinositol 3- and 4-kinase family protein with FAT domain-containing protein
Chr07:4970230_4970625 Sa940v51022374m
Chr07:4978885_4985505 Sa940v51022375m AT4G10890.2 DDE family endonuclease
Chr07:4999765_5001204 Sa940v51022376m AT5G65590.1 Dof-type zinc finger DNA-binding family protein
Four input files needed (All the same expect as step3 except instead of SDR file, it need FST file):
- FST file in the format of a single line of "CHROMOSOME POSITION".
eg. Snig_Fst_test.txt <-- generated by SDR mapping pipeline
Chr07 4885391
Chr07 4898856
Chr07 4953756
Chr07 4970230
Chr07 4985505
Chr01 8313
Below the same as step 3:
2. GFF file of the target genome. <-- download from genome annotation
3. Ortholog file in the format of for each line the first two columns are "Target_Genome_Gene_Name Arabidopsis_Ortholog_Gene_Name". <-- generated by step 1
4. Arabidopsis gene annotation in the format of each line start with the gene name followed by description. <-- generated by step 2
To run the code:
perl extract_gene_Fst_anno_willow.pl Snig_Fst_test.txt
Output file "Willow_Fst_gene_anno_output.txt" contains the following:
Locus Gene At_ortholog Annotation Fst
Chr01:8313_12196 Sa940v51000005m AT2G30942.1 transmembrane protein%2C putative (DUF3317) 8313
Chr07:4885351_4889304 Sa940v51022368m AT4G36050.2 endonuclease/exonuclease/phosphatase family protein 4885391
Chr07:4898456_4905398 Sa940v51022369m AT5G66210.1 calcium-dependent protein kinase 28 4898856
Chr07:4933756_4962623 Sa940v51022373m AT2G17930.1 Phosphatidylinositol 3- and 4-kinase family protein with FAT domain-containing protein 4953756
Chr07:4970230_4970625 Sa940v51022374m NA NA 4970230
Chr07:4978885_4985505 Sa940v51022375m AT4G10890.2 DDE family endonuclease 4985505