-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME.txt
112 lines (78 loc) · 4.81 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
We have developed two implementations for GCE, the first gce-1.0.2 is
recommended for common users, and the latter gce-alternative can be used in
comparison and for developing new algorithms. Both take the kmerfreq output
as input file, except for a little difference: gce-1.0.0 must remove the header
lines and only keeps the data lines, while gce-alternative does not need that.
1. gce-1.0.2
GCE (genomic charactor estimator) is a bayes model based method to estimate the
genome size ,genomic repeat content and the heterozygsis rate of the sequencing
sample. The estimated result can be used to design the sequencing strategy.
GCE is primarily hosted on BGI's ftp site (ftp://ftp.genomics.org.cn/pub/gce).
Now the lastest version gce-1.0.2 is also available on Github (https://github.com/fanagislab/GCE).
Note that gce-1.0.2 is compatible with the latest kmerfreq version 4.0 (max
depth 65535), which is available on Github (https://github.com/fanagislab/kmerfreq).
INSTALLATION
Download the package and run
tar -xzvf gce.tar.gz
make (to build the executable file "gce")
in the compiled version, you can use the gce directly.
USAGE
gce -f test.freq -g total_kmer_num
Options:
-f depth frequency file, is a list file containing at least two lines, the first line
is depth and the second line is frequency(not the ratio) of the depth, other
line is not recognized in the program.
-g total kmer number counted from the reads. It is suggested to set this
value for accurate estimation. If not, the total kmer number will be calculated using data in
kmer_depth_file, which often missing data and cause error in estimation
-c unqiue coverage depth. It is suggested to be set when there is no
clear peak or there is clear un-unique peaks, especially when the
heterozygous ratio is high.
-H when the heterozygous caused peak is clear, it is suggested to use
hybrid mode.
-b when there is sequencing bias, you need to set the value.
-m estiation mode, there are standard discrete model(default) and continuous model. You can
set 1 to use continuous model, but its stability is not well.
-M max depth value, information for larger depth will be ignored; If you increase this value,
the estimation accuaray will be higher, but the run speed will be slower.
-D set the raw distance for continuous model, which decide the peak
number.
-h: display help information.
Run examples:
First use a kmer counting tool to calculate kmer frequency for the sequencing data, get result file AF.kmer.freq.stat
kmerfreq -k 17 -t 10 -p AF ./raw_reads.lib
Then get the total kmer number for gce option "-g", and the depth frequency file for gce option "-f":
less AF.kmer.freq.stat | grep "#Kmer indivdual number"
less AF.kmer.freq.stat | perl -ne 'next if(/^#/ || /^\s/); print; ' | awk '{print $1"\t"$2}' > AF.kmer.freq.stat.2colum
For genome with lower heterzygous rate
./gce -g 173854609857 -f AF.freq.stat.2colum >gce.table 2>gce.log
For genome with higher heterzygous rate
./gce -g 173854609857 -f AF.freq.stat.2colum -c 75 -H 1 >gce2.table 2>gce2.log
OUTPUT
GCE generates two output files: gce.table and gce.log
The most valuable estimation results can be found at the end of gce.log file:
Final estimation table:
raw_peak effective_kmer_species effective_kmer_individuals coverage_depth genome_size a[1] b[1]
75 742400596 168346645871 75.8021 2.22087e+09 0.663012 0.271515
2. gce-alternative
Function
This package was developed by Wei Fan, [email protected], which is an alternative implementation to liubinghang's GCE software (ftp://ftp.genomics.org.cn/pub/gce).
Installation
Except the two programs coded by C++, which needs "make" to compile, the other are perl programs.
Input and output
The output file from kmerfreq can be used as input file for all the programs here.
Usage
a.Only estimate genome size, with erroneous k-mers excluded, and float-point estimatation of peak coverage value
perl ../estimate_genome_size.pl reads.freq.stat
b.Estimate genome size as well as repeat and heterozygosity, using discrete model, suitable for theoretic and good sequencing data without coverage bias
perl ../estimate_genome_character.pl ./reads.freq.stat
c.Estimate genome size as well as repeat and heterozygosity, using continuous model, suitable for bad or common real sequencing data with severe coverage bias
perl ../estimate_genome_character_real.pl ./reads.freq.stat
3. Reference
Binghang Liu, Yujian Shi, Jianying Yuan, et al. and Wei Fan*. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome project. arXiv.org arXiv: 1308.2012. (2013)
https://arxiv.org/abs/1308.2012
4.Help
http://blog.sciencenet.cn/blog-3406804-1162384.html
http://blog.sciencenet.cn/blog-3406804-1161524.html
5. Contact
Please send an e-mail to [email protected] and [email protected];