GitHub - guanqiaofeng/Gene_Annotation: Willow and poplar SDR (sex determination region) gene identification and annotation based on Arabidopsis orthologs

Willow/poplar SDR gene identification and annotation based on Arabidopsis orthologs

This pipeline contains 4 steps. Step 1 and 2 only need to be done once for each genome; and you may get these results from other rescources. I will provide you all files needed for step 3 (identify SDR genes) and 4 (identify genes with high Fst). So you can jumpt directly to step 3.

1. Ortholog identification

Identify target genome's (S. purpurea pacbio annotaion v5.1/P. trichocharpa pacbio annotation v1.1) best bastp hit in Arabidopsis database (Araport11 annotation). This is a "quick-and-dirty" way to identify "ortholog". But in our case I think it works for the purpose.

$ makeblastdb -in Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa -dbtype prot -out Araport11.db

$ blastp -db Araport11.db -query Spurpurea94006v5.1.primaryTrs.pep.fa -out Spur_v5.1_to_Ara11.txt -evalue 1e-5 -outfmt 6 -max_target_seqs 1

$ blastp -db Araport11.db -query Ptrichocarpavar145v1.1b.primaryTrs.pep.fa -out Ptri_v1.1_to_Ara11.txt -evalue 1e-5 -outfmt 6 -max_target_seqs 1

2. Find Arabidopsis Annotation

I find these information from "Araport11_genes.201606.pep.fasta" from Tair website. Two things needs to be done: 1) extract the primary transcripts; 2) extract their annotation. There might be better ways to find these information. You may find a file contain these information already. Then you can skip this step.

perl extract_At_anno.pl

Code of the extract_At_anno.pl

#!/usr/bin/perl

%hash = ();
@seqs = ();

open FH, "<Athaliana_447_Araport11.protein_primaryTranscriptOnly.fa";
while (<FH>)
{
    if (/^>(\S+)/)
    {
        $name = $1;
        $hash{$name} = 0;
    }
}
close FH;

open FH, "<Araport11_genes.201606.pep.fasta";
open OUT, ">Araport11_gene_anno.txt";
while (<FH>)
{
    $seq = $_;
    chomp $seq;
    if (/^>(\S+)/)
    {
        $gene = $1;
        if (exists ($hash{$gene}))
        {
            @seqs = split(/\|/, $seq);
            $seqs[0] =~ s/>//;
            print OUT "$seqs[0]\t$seqs[1]\n";
        }
    }
}
close FH;
close OUT;

3. Identify SDR genes (S. nigra example)

Four input files needed:

SDR file in the format of a single line of "CHROMOSOME START_POSITION END_POSITION".
eg. Snig_SDR.txt <-- generated by SDR mapping pipeline

Chr07	4880000	6860000

GFF file of the target genome.
eg. Spurpurea94006v5.1.gene.gff3 <-- download from genome annotation

##gff-version 3
##annot-version v5.1
##species Salix purpurea 94006
Chr01	JGI	gene	8313	12196	.	+	.	ID=Sa940v51000005m.g;Name=Sa940v51000005m.g
Chr01	JGI	mRNA	8328	10828	.	+	.	ID=PAC4GC:37960894;Name=Sa940v51000005m;longest=1;Parent=Sa940v51000005m.g
Chr01	JGI	five_prime_UTR	8328	8481	.	+	.	ID=PAC4GC:37960894.five_prime_UTR.1;Parent=PAC4GC:37960894
Chr01	JGI	CDS	8482	8511	.	+	0	ID=PAC4GC:37960894.CDS.1;Parent=PAC4GC:37960894
Chr01	JGI	CDS	8705	8771	.	+	0	ID=PAC4GC:37960894.CDS.2;Parent=PAC4GC:37960894
Chr01	JGI	CDS	9974	10069	.	+	2	ID=PAC4GC:37960894.CDS.3;Parent=PAC4GC:37960894
Chr01	JGI	CDS	10164	10279	.	+	2	ID=PAC4GC:37960894.CDS.4;Parent=PAC4GC:37960894

Ortholog file in the format of for each line the first two columns are "Target_Genome_Gene_Name Arabidopsis_Ortholog_Gene_Name".
eg. Spur_b5.1_to_Ara11.txt <-- generated by step 1

Sa940v51059915m	AT2G17930.1	84.934	677	102	0	1	677	1341	2017	0.0	1152
Sa940v51061191m	AT3G24880.1	68.182	110	35	0	3	112	535	644	5.62e-49	169
Sa940v51061190m	AT3G24880.1	58.515	229	94	1	1	229	712	939	2.74e-81	265
Sa940v51061192m	AT3G60860.1	77.802	464	103	0	15	478	398	861	0.0	751
Sa940v51061189m	AT3G24870.1	64.474	76	27	0	2	77	990	1065	5.95e-26	99.4
Sa940v51061138m	AT5G19820.1	47.619	84	44	0	22	105	988	1071	5.18e-22	89.7

Arabidopsis gene annotation in the format of each line start with the gene name followed by description.
eg. Araport11_gene_anno.txt <-- generated by step 2

AT1G01010.1 	 NAC domain containing protein 1 
AT1G01020.1 	 ARV1 family protein 
AT1G01030.1 	 AP2/B3-like transcriptional factor family protein 
AT1G01040.2 	 dicer-like 1 
AT1G01050.1 	 pyrophosphorylase 1

To run the job:

perl extract_gene_SDG_anno_willow.pl Snig_SDR.txt

Output file "Willow_SDR_gene_anno_output.txt" contains the following:

Locus	Gene	At_ortholog	Annotation
Chr07:4885351_4889304	Sa940v51022368m	AT4G36050.2 	 endonuclease/exonuclease/phosphatase family protein 
Chr07:4898456_4905398	Sa940v51022369m	AT5G66210.1 	 calcium-dependent protein kinase 28 
Chr07:4933756_4962623	Sa940v51022373m	AT2G17930.1 	 Phosphatidylinositol 3- and 4-kinase family protein with FAT domain-containing protein 
Chr07:4970230_4970625	Sa940v51022374m	
Chr07:4978885_4985505	Sa940v51022375m	AT4G10890.2 	 DDE family endonuclease 
Chr07:4999765_5001204	Sa940v51022376m	AT5G65590.1 	 Dof-type zinc finger DNA-binding family protein

4. High Fst located gene (S. nigra example)

Four input files needed (All the same expect as step3 except instead of SDR file, it need FST file):

FST file in the format of a single line of "CHROMOSOME POSITION".
eg. Snig_Fst_test.txt <-- generated by SDR mapping pipeline

Chr07	4885391
Chr07	4898856
Chr07	4953756
Chr07	4970230
Chr07	4985505
Chr01	8313

Below the same as step 3: 2. GFF file of the target genome. <-- download from genome annotation
3. Ortholog file in the format of for each line the first two columns are "Target_Genome_Gene_Name Arabidopsis_Ortholog_Gene_Name". <-- generated by step 1
4. Arabidopsis gene annotation in the format of each line start with the gene name followed by description. <-- generated by step 2

To run the code:

perl extract_gene_Fst_anno_willow.pl Snig_Fst_test.txt

Output file "Willow_Fst_gene_anno_output.txt" contains the following:

Locus	Gene	At_ortholog	Annotation	Fst
Chr01:8313_12196	Sa940v51000005m	AT2G30942.1 	 transmembrane protein%2C putative (DUF3317) 	8313
Chr07:4885351_4889304	Sa940v51022368m	AT4G36050.2 	 endonuclease/exonuclease/phosphatase family protein 	4885391
Chr07:4898456_4905398	Sa940v51022369m	AT5G66210.1 	 calcium-dependent protein kinase 28 	4898856
Chr07:4933756_4962623	Sa940v51022373m	AT2G17930.1 	 Phosphatidylinositol 3- and 4-kinase family protein with FAT domain-containing protein 	4953756
Chr07:4970230_4970625	Sa940v51022374m	NA	NA	4970230
Chr07:4978885_4985505	Sa940v51022375m	AT4G10890.2 	 DDE family endonuclease 	4985505

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
poplar		poplar
willow		willow
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Willow/poplar SDR gene identification and annotation based on Arabidopsis orthologs

1. Ortholog identification

2. Find Arabidopsis Annotation

3. Identify SDR genes (S. nigra example)

4. High Fst located gene (S. nigra example)

About

Releases

Packages

Languages

guanqiaofeng/Gene_Annotation

Folders and files

Latest commit

History

Repository files navigation

Willow/poplar SDR gene identification and annotation based on Arabidopsis orthologs

1. Ortholog identification

2. Find Arabidopsis Annotation

3. Identify SDR genes (S. nigra example)

4. High Fst located gene (S. nigra example)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages