1. Do you think that in a phylogenetic tree the parasites that use similar hosts will group together?
I think it is a reasonable hypothesis that parasites with similar hosts would be closer together in a phylogenetic tree. Parasites use the the cell machinery of the hosts and the machinery of these cells will be more similar with more similar hosts.
Theoretically bird scaffolds should have higher GC content, but here small ones will remain because there is a higher chance for random GC content if you sequence is smaller, the longer it is the harder it would be.
Species | Host | Genome size | Genes | Genomic GC |
---|---|---|---|---|
1 Plasmodium berghei | rodents | 20493455 | 4050 | .28541442281221172004 |
2 Plasmodium cynomolgi | macaques | 17954629 | 7282 | .25074910760397063387 |
3 Plasmodium falciparum | humans | 26181343 | 5787 | .42160775686953995001 |
4 Plasmodium knowlesi | lemures | 23270305 | 5207 | .23901099876307077281 |
5 Plasmodium vivax | humans | 23462346 | 4953 | .40271397184768355577 |
6 Plasmodium yoelii | rodents | 27007701 | 5682 | .46163055324301189190 |
7 Haemoproteus tartakovskyi | birds | 22222369 | (edited) 3910 | .23770659760824166960 |
8 Toxoplasma gondii | humans | 128105889 | 15892 | .58519328143783770325 |
#for loop through each file, genome size
for file in *.genome; do name=$(echo ${file%.genome});count=$(cat $file | grep -v \>| tr -d '\n' | wc -m); echo $name $count; done
Haemoproteus_tartakovskyi 20493455
Plasmodium_berghei 17954629
Plasmodium_cynomolgi 26181343
Plasmodium_faciparum 23270305
Plasmodium_knowlesi 23462346
Plasmodium_vivax 27007701
Plasmodium_yoelii 22222369
Toxoplasma_gondii 128105889
#for loop through each file, gene
for file in *.fna; do name=$(echo ${file%.fna});count=$(cat $file | grep \>| wc -l); echo $name $count; done
Ht 4050
Pb 7282
Pc 5787
Pf 5207
Pk 4953
Pv 5682
Py 4919
Tg 15892
#for loop through gc content
for file in *.fna; do name=$(echo ${file%.fna});count=$( echo $(cat $file | grep -v \> |tr -cd "CGcg" |wc -m)/$(cat $file | grep -v \> |tr -d '\n' | wc -m) |bc -l); echo $name $count; done
Ht .28541442281221172004
Pb .25074910760397063387
Pc .42160775686953995001
Pf .23901099876307077281
Pk .40271397184768355577
Pv .46163055324301189190
Py .23770659760824166960
Tg .58519328143783770325
4. Compare the genome sizes with other eukaryotes and bacteria. Discuss with your partner (that is student partner) the reason for the observed genome sizes.
For bacteria the average genome size is 5Mb reference. 10Mb to >100 000Mb is average genome size for eukaryotes reference. Our average genome size is ~20Mb so small for a eukaryote. One possible explanation for the smal genome is that plasmodium are parasites, which can use host cell proteins/machinery for its own purposes.
One reason that there is biased GC-contents is that perhaps there was horizontal gene transfer. Perhaps there is bias gene expression in the organism.
In bash it creates a list.
7. Compare how many BUSCOs (orthologues proteins) that arefound in each proteome. Do the investigated parasites have close to complete numbers of BUSCOs?
(CHECK)
#for loop for each species
for file in *.faa; do count=$(cat ${file%.faa}/run_apicomplexa_odb10/full_table.tsv | grep -v '#'| grep -v "Missing"| grep -v "Fragmented" | cut -f 1 | sort -u | wc -l); echo ${file%.faa} $count ;done
# Ht 326
# Pb 372
# Pc 429
# Pf 436
# Pk 323
# Pv 437
# Py 434
# Tg 384
We have the fewest BUSCO genes in Pk(323) and the most in Pv(437). The complete number of BUSCO genes is 446. I do not know the threshold for whether something is close to complete but 437/446 is pretty good and is 326/446 is not bad.
#now to see what percent are complete/ duplicate
for file in *.faa; do count=$(echo $(cat ${file%.faa}/run_apicomplexa_odb10/full_table.tsv |grep -v "#" | grep -v "Missing"| grep -v "Fragmented" | cut -f 1 | sort -u | wc -l)/ $(cat ${file%.faa}/run_apicomplexa_odb10/full_table.tsv | grep -v '#' | cut -f 1 | sort -u | wc -l)|bc -l); echo ${file%.faa} $count ;done
# Ht .73094170403587443946
# Pb .83408071748878923766
# Pc .96188340807174887892
# Pf .97757847533632286995
# Pk .72421524663677130044
# Pv .97982062780269058295
# Py .97309417040358744394
# Tg .86098654708520179372
Here we see that percentage wise, the fewest BUSCO genes were found in Pk(0.72) and the most Pv(0.98). Again, unsure of threshhold, but does not seem bad.
8. Do you think that the assembly of the Haemoproteus takowskyi genome is a reasonable approximation of the true genome?
(CHECK) I do not think its too bad, however we do have to keep in mind that only one other species had fewer BUSCO genes found.
(CHECK)
#make one concatonated file
for file in *.faa; do cat ${file%.faa}/run_apicomplexa_odb10/full_table.tsv | grep -v "#" | grep -v "Missing"| grep -v "Fragmented" | cut -f 1 | sort -u >>concat_full_table.tsv ;done
#find BUSCO genes found in all 8
cat concat_full_table.tsv | sort | uniq -c| cut -d " " -f 7 | grep 8 | wc -l
# 185
#now for 7 species
#one contatonated file
ls *.faa | grep -v Tg.faa | while read file; do cat ${file%.faa}/run_apicomplexa_odb10/full_table.tsv | grep -v "#" | grep -v "Missing"| grep -v "Fragmented" | cut -f 1 | sort -u >>7org_full_table.tsv ;done
185 BUSCOs are found in all eight organisms.
10. If Toxoplasma is removed, how many BUSCOs are shared among the remaining seven species. Interpret!
#now for 7 species
#one contatonated file
ls *.faa | grep -v Tg.faa | while read file; do cat ${file%.faa}/run_apicomplexa_odb10/full_table.tsv | grep -v "#" | grep -v "Missing"| grep -v "Fragmented" | cut -f 1 | sort -u >>7org_full_table.tsv ;done
#find BUSCO found in all 7
cat 7org_full_table.tsv | sort | uniq -c| cut -d " " -f 7 | grep 7 | wc -l
204
204 BUSCOS are found in all seven organisms. If Taxoplasma is our outgroup, it makes sense that more BUSCO genes were found within the ingroup.
As seen by the tree above, it is the same as 'true' species tree.
Plasmodium falciparum is the most distantly related species of plasmodium, and the closest related plasmodium to Haemoproteus.