-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathwcd.info
2819 lines (2066 loc) · 108 KB
/
wcd.info
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
This is wcd.info, produced by makeinfo version 4.8 from wcd.texi.
File: wcd.info, Node: Top, Next: Description, Up: (dir)
The `wcd' tools (pronounced 'wicked') do EST clustering. The
most recent version is called `wcd-express'
This is the `wcd-express' manual.
Copyright (C) 2003-2010 Scott Hazelhurst, University of the
Witwatersrand, Johannesburg. This documentation is distributed under
the GNU Free Documentation Licence.
* Menu:
* Description:: What EST clustering is
* Installing wcd:: How to install it
* Running wcd:: How to get it to work
* wcd inside stackPACK:: How to incorporate wcd into the stackPACK clustering system
* Technical manual:: The gory details
* Testing:: How tested
* Acknowledgements and copyright:: Thanks and copyright information
* Concept Index:: Some index terms
File: wcd.info, Node: Description, Next: Installing wcd, Prev: Top, Up: Top
1 Description
*************
This manual describes `wcd' (pronounced `wicked'), a program that
performs clustering of sequences. Essentially, it takes a set of DNA
sequences (typically Expressed Sequence Tags, ESTs), and clusters the
sequences into the related groups.
It reads in a set of DNA sequences (typically ESTs or mRNAs) in FASTA
format and organises them into clusters of similar sequences, producing
a list of clusters and their members on standard output.
The clustering performed is single-linkage clustering. Sequences are
put in the same cluster if they have regions of close similarity.
Similarity is determined using one of the standard distance functions.
In versions > 0.2.0 of `wcd', three distance functions are supported:
the d2 function, edit distance and a common word heuristic. This may be
expanded in later versions of `wcd'.
`wcd' has been used mainly on ESTs but also on genomic sequences
(particularly metagenomic sets). The largest sequence sets known to be
clustered are: 6.6M 454 DNA sequences (2.55G bases) 35 hours 27 minutes
using 32 Irwindale (3.6GHz) Intel Xeon (1 M of L2 RAM); and 18M 454
DNA sequences (just under 8G) on the same cluster which took just less
than 10 days; the 18M set took 84 hours on a cluster of 128 AMD 2218
2.6GHz machines.
When using the d2 distance function, `wcd''s results are similar to
those generated by the `d2_cluster' program, which also implements
clustering based on the d2 distance function.
The use of the d2 distance function in DNA sequence comparison is
described in Hide et al. (1994) and Burke et al. (1999), and has been
shown to be sensitive to sequence similarity even in the absence of
direct sequence alignments. This corresponds with the nature of EST data
which typically contains single-base read errors and may contain
sequence from different splice forms of the same transcript (leading to
local dissimilarity).
Thanks to the properties of the d2 distance function, sequence
clusters created by `wcd' correspond to closely gene classes i.e. all
the ESTs originating from a specific gene will be clustered into a
single class.
`wcd' is written in C and has been successfully installed and tested
on Linux, FreeBSD, SGI Irix, Sun Solaris, Windows 2000 (under Cygwin)
and Windows XP.
Users of `wcd' are welcome to contact the developer.
The remainder of this manual assumes that a user understand the basic
principles of what EST clustering is about, and so the basic concepts
are only briefly reviewed in this chapter. Subsequent chapters discuss
installing and running `wcd', as well as giving some technical
background so someone who wants to use some of the code.
1.1 Citing `wcd'
================
If you cite `wcd' please use the following reference (this list may be
updated.)
* S. Hazelhurst, W. Hide, Z. Liptak, R. Nogueira. R. Starfield
An overview of the `wcd' EST clustering tool.
_Bioinformatics_ 24(17), July 2008, pp. 1546-1547. doi:
`10.1093/bioinformatics/btn203'.
A description of the underlying algorithms can be found in
* S. Hazelhurst. Algorithms for clustering EST sequences: the
`wcd' tool. _South African Computer Journal_, _40_, June
2008.
A paper describing the KABOOM algorithm has been accepted by
Bioinformatics
* S. Hazelhurst and Z. Liptak. KABOOM! A new suffix-array based
algorithm for clustering expression data. _Bioinformatics_
(Accepted for publication)
1.2 Clustering
==============
The basic idea is that we take a set of sequences and cluster them so
that sequences that are close to each other in the same cluster. There
are many clustering algorithms, and one review can be found in
@article{cluster2,
author = "A. K. Jain and M. N. Murty and P. J. Flynn",
title = "Data clustering: a review.",
journal = "ACM Comp. Surveys",
volume = "31",
number = "3",
pages = "264--323",
month = sep,
year = "1999"
}
The type of clustering we do is single linkage clustering. We create
the largest number of clusters (and hence smallest clusters) such that
if two sequences are very close they are put in the same clusters. Note
the definition is transitive: if A and B are close, and B and C are
close then A and C will be in the same cluster even if A and C are far
apart.
The basic clustering algorithm is shown below (obviously, there are
performance enhancements).
foreach sequence i
put i in its own cluster
foreach sequence i
foreach sequence j
if i and j are close to each other
merge the clusters to which i and j belong
1.3 Distance functions
======================
A distance function takes two sequences and returns a number that says
how far apart the sequences are mathematically. Many distance functions
have been proposed with different biological and computational
motivations: hopefully a well-chosen distance function will return a
mathematical value that has some biological distance.
One of the purposes of `wcd' is to allow different distance
functions to be used. The distance functions which `wcd' supports
* d2, edit distance
1.3.1 The D2 distance function
------------------------------
The d2 distance function is based on the word count.
* Let _x_ be a sequence and _w_ a word
We use the notation x' \in x to mean that x' is a _window_
or substring of _x_ of some specified length.
* c_x(w) is the number of times that _w_ appears in _x_.
1.3.1.1 General formulation
...........................
To measure the distance between sequences we compute the word
frequencies and then take the sum of the square of the differences
For a fixed _k_, we examine all the words _w_ of length _k_, and
compute the frequencies which they occur in the strings _x_ and _y_,
and then sum the square of the differences. Formally,
d^2_k(x,y) = \sum_|w|=k(c_x(w)-c_y(w))
Then in general
d^2(x,y) = \sum_k=l^Ld^2_k(x,y)
for constants _l_ and _L_. In practice (and this is what `wcd' does),
we compute the d2 score for a fixed _k_. The default value of _k_ is
6, but this is a parameter of the program.
1.3.1.2 Windows
...............
The goal of EST clustering is to see whether there is an overlap between
two sequences: thus we are no interested in the d2 score between the
whole of one sequence and the whole of another, but whether there are
two regions (called windows) that have a low d2 score.
Thus the d2 score between two sequences is the minimum of the d2
score between all pairs of windows of those two sequences. d^2_k(x,y)
= \min_x' \in x, y' \in y\sum_|w|=k(c_x'(w)-c_y'(w))
The consequence of this is that d2 applied like this does not obey
the triangle inequality. It is very common that
d^2(x,z) > d^2(x,y)+d^2(y,z)
since there is a window that _x_ shares with _y_, another that _y_
shares with _z_ but no window common to both _x_ and _z_.
As an aside, d2 even when applied to sequences is not a metric since
d^2(x,y)=0 does not necessarily imply that x=y.
We don't compare two sequences directly. Rather we compare all
windows of fixed length in one sequence with all the windows in the
other. The default window length is 100 (again it's parameterisable).
1.3.1.3 References
..................
The paper by Burke et al describes the basic d2 method and justifies
its use.
@article{burke99,
author = "J. Burke and D. Davison and W. Hide",
title = "{D2\_cluster: A Validated Method for Clustering EST and
Full-length cDNA Sequences}",
journal = "{Genome Research}",
year = 1999,
volume = 9,
number = 11,
pages = "1135--1142"
}
The SACJ paper (hazel2008b) describes the algorithms in wcd in some
detail. The Bioinformatics paper describes the use of wcd and gives
comparative performance (the SACJ paper also concentrates on an earlier
version of wcd). If you use wcd, please cite one of these papers.
@article{hazel2008a,
journal = "Bioinformatics",
year = 2008,
note = "doi:10.1093/bioinformatics/btn203",
title = "An overview of the wcd {EST} Clustering Tool",
author = "S. Hazelhurst and W. Hide and Z. Lipt\'ak and R. Nogueira and R Starfield",
month = jul,
volume = 24,
number = 13,
pages = "1542-1546"
doi = "10.1093/bioinformatics/btn203",
}
@article{hazel2008b,
journal = "South African Computer Journal",
year = 2008,
title = "Algorithms for clustering {EST} sequences: the wcd tool",
month = jun,
volume = 40,
author = "S. Hazelhurst",
pages = "51-62",
}
1.3.2 Edit distance
-------------------
Edit distance is well known; the reference below gives a very thorough
introduction to the subject.
The version of edit distance implemented in `wcd' does local
alignment (Smith-Waterman). Penalties (positive numbers) are given for
mismatches, deletions or gaps, and a reward (a negative number) is
given for a match. The default values are:
* -1 for a match
* +3 for mismatch
* +5 for opening a gap (indel)
* +2 for extending a gap
The user can change these values.
@Book{ gusfield97,
author = "D. Gusfield",
title = "Algorithms on strings, trees, and sequences",
address = "Cambridge, United Kingdom",
publisher = "Cambridge University Press",
year = 1997
}
File: wcd.info, Node: Installing wcd, Next: Running wcd, Prev: Description, Up: Top
2 Installing `wcd'
******************
2.1 Getting new versions of `wcd'
=================================
The latest version of `wcd' can be found at:
`http://code.google.com/p/wcdest'
You should also be able to find it at:
`ftp://ftp.cs.wits.ac.za/pub/research/software/'
Please let me know if you would like to be told about new versions of
`wcd'.
2.2 Changes log
===============
2.2.1 Version 0.6.3
-------------------
Mainly internal changes - better memory behaviour and some bug fixes.
However, there are two changes in the way in which `wcd' is called.
Usually, I don't like doing this because it may make scripts break but
the alternatives here are worse. Sorry!
* The `-y' or `--mirror' option no longer takes an argument. In
0.6.1 it took the name of a file. Now, the output goes to standard
output and the `-o' option should be used.
A consequential change is that the `prep-kabm' and `dustprep-kabm'
scripts have changed and should be replaced.
* The `kl-seed' or `-$' option has changed so that it does take an
argument: _s_ for symmetric mode or _a_ for asymmetric mode. See
the documentation.
2.2.2 Version 0.6 `wcd-express'
-------------------------------
There is one very big change version 0.6, the KABOOM algorithm for
speeding up clustering. Note that for data files that are greater than
1GB in size, when using KABOOMM a 64-bit machine is needed.
Version 6 - `wcd-express' has been tested on Linux (Scientific
Linux and Ubuntu) and on MacOS X (10.5 and 10.6).
There is a binary version for Windows. However, this does not at
this point support the KABOOM heuristic. The problems is some
incompatible libraries. This is not insurmountable but will take a
couple of days we don't have at the moment (if you are prepared to do
the schlep, the problem is that the mman library is not ported, so we
have to use the Windows memory management file). We have configured and
compiled the code on Windows and you can too, but the current
conditional configuration will mean that the KABOOM is not supported.
2.2.2.1 Miscellaneous changes.
..............................
* Support for short read sequences such as 454 and Illumina
sequences. Mainly this isn't so much new things or changes in wcd,
but some experimentation so that we know which parameters to use.
Our experimentation has yielded good results.
* Support for shorter sequences, particularly for files where there
is variability in the sequence length. In previous versions of
`wcd' any sequence below the specified window length would be
thrown away. In this version of `wcd', shorter sequences than the
window length will be clustered, but the all the distance and
heuristics scaled for that sequence so that a higher level of
similarity will be required.
* Support for the type of input expected by Bragg and Stone's k-link
tool
`http://bioinformatics.oxfordjournals.org/content/25/18/2302.short'
has been added. `wcd' can be used as a pre-processor for K-link.
* An option where EST will only cluster sequences if the regions of
similarity are at appropriate ends of the ESTs (the _ends_ option
for the `-F' option).
* The reading in of FASTA files has been made more robust. Previous
versions sometimes choked silently on some FASTA files produced in
Windows systems and then went into an endless loop.
2.2.2.2 KABOOM algorithm
........................
Version 0.6 supports the k,alpha,beta multiple-word suffix-array based
algorithm for faster clustering - known as the KABOOM algorithm. This
replaces the previous suffix-array algorithm supported in version 0.5.
First, this is an option that the user must ask for. Unless you
explicitly ask for that option you will get the standard clustering
algorithm.
The KABOOM algorithm is appropriate for clustering large files -
typically we see an order of magnitude improvement in performance.
Version 0.6.0 only supports files up to 2GB in size but this is not a
difficult limit to break. If you want to use the KABOOM algorithm on
larger files, let me know. It's on my list of things to do but there's
nothing like encouragement.
The KABOOM algorithm requires that a suffix array of the input file
and its reverse complement are available. We recommend that the `mkESA'
tool of Homann, Fleer, Giegerich and Rehmsmeier be used
`http://bioinformatics.oxfordjournals.org/content/25/8/1084.short'. This
is a flexible and powerful tool that has performed exceptionally well on
our data. However, you are welcome to use whatever tool you prefer as
long as it produces output in the same format.
To create the reverse complement, `wcd' has an auxiliary feature
(the mirror or `-y' option documented below). Alternatively, the
`revseq' program in the EMBOSS suite `http://emboss.sourceforge.net/'
could be used.
2.2.3 Changes in 0.5
--------------------
Version 0.5 is big change from 0.4 internally, but not externally.
Memory representation was changed a lot to make `wcd' more robust for
very large data sets that run in highly-parallel mode. From a user
perspective, there are few differences. Here are some of the changes (a
few were put in 0.4.5.7 but may not have been noticed).
* NOAUXINFO. A new configure-time option `--enable-NOAUXINFO' has
been included. This is switched OFF by default. If you enable it,
`wcd' does not store any of the sequence auxiliary information like
sequence ID. For extremely large data sets this may make a
difference is memory usage. Also, I found on some systems, doing
several millions mallocs stress tested the memory system
implementation. My personal experience is that this option is not
needed - but one person who used the system found it useful. This
was introduced on 0.4.5.7, and may be less important in 0.5 as the
memory implementation has changed.
* Run time option: `-Q 1' Index sequences from 1 rather than 0. Only
use if you know what you are doing. See the discussion below.
* Run time option `-G val@dir'. `wcd' puts the clusters found in the
given directory. See the discussion below.
* Run time option `-k splitname -f clusterfile'. Split the input
sequence file into clusters given an existing cluster table. This
option is _post hoc_: here we have the cluster table already and
are splitting according to that. The `-G' option is for when you
don't have a cluster table and must do clustering.
2.2.4 Version 0.4
-----------------
There are a number of changes from 0.3 to 0.4 in memory organisation and
algorithm. The first two are important to note if you used the options
as the appropriate default values have changed.
* The sample word length is no longer a parameter. It's fixed as a
constant at 8. This was done after extensive empirical testing.
Fixing so that a sample word fits in a machine word has numerous
computational advantages. The -B option therefore no longer
exists. The default value for the -K is now 6.
* Our second heuristic, the so-called t/v heuristic has been
replaced in favour of a common word heuristic. The the common word
heuristic does seem better it is more expensive to implement. The
-H or -common_word heuristic's default value is now 65.
* The SMP algorithm has been simplified. It is now possible to use
the Pthreads and MPI parallelisation together. This may be useful
when you have a cluster of multi-processor machines, but if memory
is available it is better and easier just to use MPI. So if you
have 16 machines with 4 cores each, run an MPI job with 64
processes. You should only use pthreads in place of or as well as
MPI where memory is a factor and some experimentation on your
architecture shows performance improvement.
2.3 Installing the distribution
===============================
The distribution comes as a gzipped, tarred file. The following
commands should create a directory called `wcd_distrib' with all the
necessary files.
gunzip wcd_distrib.tar.gz
tar -xf wcd_distrib.tar
2.4 Compiling
=============
There are two possible ways to compile and install `wcd'
2.4.1 Method 1: using `configure'
---------------------------------
The INSTALL file gives a more detailed explanation and description.
The following should work if you don't want anything fancy. If you do,
or if you want to ensure that some compile flags are different, then
read the INSTALL file.
./configure
make
make install
Note that the last instruction will attempt to install the executable
and documentation in the appropriate places on your system and will
probably require you to have root privilege or write permission for
those standard directories. If you don't have these, you can find the
executable `wcd' in the `src' directory and the texinfo documentation
`wcd.info' in the `doc' directory. Manually copy these to the correct
place.
If you are using FreeBSD, you may have to make a small change to the
`common.h' file.
If you want a manual you can print out, say `make pdf'. Otherwise
you can read the on-line version using the `info' system. If the
documentation is installed in the right place on your system you can
just say `info wcd', otherwise you will have to say `info -f wcd.info'.
`wcd' should work on 64-bit, little-endian and big-endian machines.
2.4.1.1 MPI Version
...................
From version 0.3.0, wcd has an MPI version that allows parallelisation
of the clustering on a cluster of workstations.
./configure --enable-MPI
make
make install
Initial experimentation has shown that using the MPI option when not
necessary leads to a 2\% degradation in performance, so the main benefit
of not using the `enable-MPI' option is that `wcd' will compile without
the MPI libraries.
2.4.1.2 Pthreads
................
From version 0.3.0, wcd has an MPI version that allows parallelisation
on SMP architectures.
./configure --enable-PTHREADS
make
make install
In version 0.4, the two parallelisation techniques complement each
other. You can run on a cluster of SMP machines. However, an assumption
is made that each of the SMP machines has the same number of processors.
In version 0.3, binaries supporting both techniques could be used but
only one used in a particular run; this no longer holds in 0.4.
2.4.2 Method 2: Manually using `make'
-------------------------------------
The `wcd' program is written in reasonably standard C. If you don't
want to or can't use the autoconf method described above, try the
following. Change directory to the `src' directory.
make -f WcdMake wcd
This was tested on different versions of RedHat and Suse and worked
without problem. `wcd' is written in reasonably standard C, and so
should work on most systems, but unfortunately the vagaries of different
libraries and compilers might make the standard compilation fail. See
the troubleshooting section below for some suggestions.
To get documentation, running `make -f WcdMake pdf' in the `src'
directory will produce a pdf version of the manual (placing output in
the `doc' directory). `make -f WcdMake info' will produce texinfo
version.
2.4.2.1 Trouble shooting
........................
If you know something about your system and are happy mucking about
with make files, please read the technical note below and make the
necessary changes. If not, try the following suggestions. There are
three likely causes of problems: (1) `wcd' supports long options (e.g.
you can say `--cluster_compare' rather than `-D'); (2) your compiler is
`cc' rather than `gcc'; and (3) your compiler does not support
procedure inlining (inlining may make some performance benefit, but
does not change functionality or use). The makefile `WcdMake' is the
one to look at.
1. Try `make -f WcdMake simple' to see whether the problem is that you
don't have the libraries that support long options. If this works,
`wcd' will run as is, except that you cannot use the long options
form for `wcd', only the short-form options.
2. If your compiler is named `cc' rather than `gcc', try saying `make
-f WcdMake ccsimple'.
3. If that doesn't work, try switching off inlining as well by doing
`make reallysimple'. In this case neither long options or inlining
is used. If your compiler is `cc' then try `make -f WcdMake
ccreallysimple'.
`wcd' was tested on a number of different systems. The table below
shows what worked on those systems. With a little tweaking of the
makefile to specify where _gcc_ libraries on your machine are, you
might get the standard `make wcd' to work too.
_Linux RedHat and Suse_
make wcd
_Sun OS 5.10, Darwin, Aix, FreeBSD_
make simple
_Irix64, Alpha/OSF1 Tru64, Cray_
make ccreallysimple
With the FreeBSD machine I managed to get the full version to work by
finding the appropriate libraries and includes. The same thing probably
applies to the others.
*This section only need be read if `make wcd' did not work and you
would like to have long options working.*
The non-standard options of `wcd' are to use inlining and the long
options library. If your system does not support inlining then the
compiler must be given the `-DNOINLINE' flag when run. If your system
does not provide a library for getting long options then the compiler
must be given the `-DNOLONGOPT' flag.
If, however, your system does provide a long options library and you
would like to use long options, then you must say where the include file
(`getopt.h') and where the library will be. You may also have to
explicitly tell the compiler to use the `libtextlib' library. Find out
where these things are, and change the variables in the `WcdMake' to
the appropriate values and then run `make try'. For example, if the
include file is in `/usr/share/contrib/gcc' and the library is in
`/usr/local/lib' you might have
INCLUDES = /usr/share/contrib/gcc
LIBS = /usr/lib
EXTRAFLAGS = gettextlib
You might or might not have the EXTRAFLAGS variable defined.
The makefile assumes that the compiler available is `gcc'. If it's
not or if you wish to use a different compiler, then edit the file
`Makefile' and change the definition of the variable `CC' to the
appropriate compiler.
If this doesn't work, then it may be that your libraries are in
non-standard places. Look at the Makefile and try the following things
* Add `-DNOLONGOPT' to the CFLAGS variable. In this case, you can't
use the long form options (e.g. you must say `-g' rather than
`--histogram')
* Try replacing `gcc' by `cc'.
* Add to the CFLAGS the directories where the include and libraries
can be found.
`wcd''s MPI and PTHREADS code are optional. In the code there are
macros that protect all references to MPI and PTHREADS. Unless these
macros are defined the compiler won't know about them. This enables
`wcd' to be compiled without having either libraries.
If you are making manually rather than through configure, and you
want to use MPI or PTHREADS (either or both), then you must make sure
that either (or both) the MPI are PTHREADS macros are defined. You could
either put the relevant defines in the `wcd.h' file, or pass it to
`gcc' with something like `-D MPI'.
2.4.3 EMBOSS
------------
EMBOSS wrappers are available from the same source as the `wcd' code
itself. However, they do not currently support new features in 0.6
2.4.4 Auxiliary programs
------------------------
A number of auxiliary programs are included. They may be useful but are
not essential. They are either Perl or Python
They should run as is, but if your version of `perl' or `python' is
not in a standard place, edit the files and change their first lines to
replace the line that says
#!/usr/bin/perl
with
#!/usr/local/bin/perl
or whatever.
In addition to these programs, there is a shell script
`wcd_wrapper.sh' that allows `wcd' to be used as a replacement for
`d2_cluster' in the `stackPACK' analysis pipeline.
2.5 Documentation
=================
This info file can be read by saying
info -f wcd.info
To create the documentation say `make pdf, make html', etc depending on
what sort of document you want. This creates the following:
* `wcd.info': the info file which can be read using the texinfo
system. This is probably easiest for reference purposes. You can
read the on-line version using the `info' system. If the
documentation is installed in the right place on your system you
can just say `info wcd', otherwise you will have to say `info -f
wcd.info', or give the full path.
* `html': a directory that contains all this help information in
HTML files. Use your browser to read it.
* `wcd.pdf': A PDF version of this help file, which is probably the
best for a hard copy.
You can play around with `makeinfo' etc. if you want different versions.
2.6 Using the Amazon Elastic Computing Cloud and S3
===================================================
`wcd' has successfully been used on the Amazon Elastic Computing Cloud.
To learn more, please go to `www.amazonaws.com' for an overview of EC2
and my paper _Scientific computing using virtual high-performance
computing: a case study using the Amazon elastic computing cloud_
published in _Proceedings of the 2008 Annual Research Conference of the
South African Institute of Computer Scientists and Information
Technologists_, October 2008
`http://doi.acm.org/10.1145/1456659.1456671'.
This section assumes you are a registered EC2 user and know how to
use it.
2.6.1 Pre-packaged AMI
----------------------
Public prepackaged images are evailable from AmazonAWS. The AMI's
manifest name will be `wcdimages/wcd-express-6.x.y' so please search
for manifests with `wcd-express' in them.
The AMI is based on the standard Amazon Linux AMI. In addition to the
packages that are part of that, gcc, gdb, and emacs are installed as
well as the mkESA, libfid and wcd code.
The default user for this AMI is called `ec2-user'
The `wcd' binary and ancillary scripts can be found in
`/usr/local/bin' and all the documentation can be found in
`/usr/local/doc'. In the `/home/ec2-user/Data' directory are two sample
directories.
Using the standard wcd algorithm, check all is well be saying
wcd -g ~/Data/benchmark1000.fa
Check that the KABOOM algorithm is working by saying
prep-kabm ~/Data/benchmark10000.fa -g
2.7 Guide to using MPI
======================
Note that the KABOOM algorithm is not parallelised yet. However, if you
choose not to use KABOOM, you can use MPI This section gives my
suggestions for running and configuring MPI. MPI generally is easy to
install and get to work. However, having tried to use wcd in a number
of environments and tried to help an number of wcd users, I have
encountered very severe problems caused by machines on which multiple
MPI implementations have been installed. Sometimes this could be
completely different versions of MPI (e.g. LAMMPI and MPICH2), and
sometimes just by different versions of of the same MPI (e.g.
openmpi1.2.3 and open1.2.8). What can happen is that the version of the
compiler mpicc is a slightly different (or even very different) version
to mpirun, or to some of the mpi libraries.
This is a really horrible problem. Often the problem does not
manifest itself clearly. There may not be a problem when run on one
machine only. For small data sets there may be no or few problems but
as the data set size and number of machines grow then problems do occur.
I find that the problem is particularly occurs when automatic package
managers like apt-get or fink are used.
If you are sure that you are only going to have one version of MPI on
your system ever, and you are never going to upgrade it then ignore
these rambling. I am sure there are much cleverer and better ways of
doing things, but this works for me as a clean way of doing things so
that if I add or change things that I don't break previous
assumptions.(Or use the `modules' or `env-switcher' packages!).
First, I make sure that any existing MPI library or software is
deleted from any obvious places, including /usr/lib, /usr/bin,
/usr/local/bin, /usr/lib, /sw/bin, etc etc. If necessary do something
like
`find / -name ``*mpi'' --print'
Then, create a special directory for each MPI installation you want.
I use either /opt or /usr/local for this. e.g. /usr/local/mpich2
Install the MPI version in this directory. If your package manager
can't do this, then get the source files and compile from scratch. When
you using configure you will say something like
`./configure --prefix=/usr/local/mpich2'
All the library and binary files will be put into this directory. You
must now add the lib and bin directories to your path
setenv PATH /usr/local/mpich2/bin:${PATH}
setenv LD_LIBRARY_PATH /usr/local/mpich2/lib:{PATH}
If you install a new version or upgrade even, put the new version in
new directory and change your paths.
File: wcd.info, Node: Running wcd, Next: wcd inside stackPACK, Prev: Installing wcd, Up: Top
3 Running wcd
*************
`wcd' can be invoked in different ways. The first just shows the usage;
the others are functional. They take an input file in FASTA format and
process it in some way. `wcd' numbers the sequences in the file from 0
upwards, and these indices are used to refer to the sequences.
Options or switches to `wcd' can be given in short form (e.g. `-D')
or long form (e.g. `--cluster_compare'). On some systems, only the
short form options are supported.
3.1 More detail
===============
3.2 Preparing your data: Garbage in, garbage out
=================================================
The value of clustering depends very strongly on the quality of the
input data. In particular, the presence of vectors, repeats and
significant low complexity data will compromise the quality of your
clustering. For this reason is strongly recommended that you put effort
into cleaning your data and doing masking before you cluster.
3.3 Examples -- A Quick Introduction to `wcd'
=============================================
This section shows common ways in which `wcd' is likely to be invoked.
The following examples show straightforward clustering examples.
> `wcd --show_clusters data/5000.seq'
Cluster the sequences found in the file `data/5000.seq'. Print
the clusters on standard output in compact form. Use the d2
function to determine cluster membership.
> `wcd --histogram --show_clusters data/5000.seq'
As above, but also print a histogram that shows the size of the
clusters found.
> `wcd --output ans/5000.ans --histogram --c data/5000.seq '
> `wcd -o ans/5000.ans -g -t data/5000.seq '
As above, but save the output in a file.
> `wcd -c -N 5 data/5000.seq '
_If the `wcd' has been installed with the PTHREADS option_. Run
`wcd' on 5 processors at the same time.
> `mpirun -np 16 wcd -c data/5000.seq '
_If the `wcd' has been installed with the MPI option._ Run `wcd'
on 16 processors using the MPI libraries.
> `wcd -X -c data/5000.seq '
Cluster, but also use clone information. If two ESTs come from the
same clone, they'll also be put together. The clone information
comes from the FASTA file directly - it's the symbol that
follows the word "clone" in the header. This is a convenient
option, but for larger files it would be better to put this
information in a constraints file.
> `wcd --kabm -U 16 -c data_all.fa'
Cluster using the KABOOM clustering algorithm to speed up. This
assumes that the data file contains the sequences and the reverse
complement and that the suffix array is also available.
3.4 Parameters
==============
`wcd' like most tools has many parameters that can be used in
controlling the clustering. Broadly they can be divided into two
classes: parameters that control a biological model of what we want
clustered together; and heuristic parameters that we can use to
trade-off speed and quality. `wcd' comes with sensible default but
experience shows that the choice of parameters really matters and that
it data set dependant. Our advice is that you need to cluster with
different sets of parameters and then compare the results that you get.
Below are the key parameters: the first two are the biological
parameters, the latter are the heuristic parameters. If you are using
the KABOOM algorithm, those heuristics are described later.
* Window length `-l', `--window_len'. `wcd' clusters sequences
together if they have share regions that are approximately same.
This parameter says how the long these regions should be. The
default is 100: this means that two sequence will be clustered
if there are two regions of length at least 100 that are
approximately the same. The bigger `window_len' (all other things
being equal), the stricter the clustering.
* Clustering threshhold `-T', `--theta'. This says how similar the
regions should be. For d2 clustering, a `theta' value of 0 would
be saying that two sequences should be clustered together if
they share regions (of length specified using `window_len') that
were exactly the same. The default value is `40'. For a window
length of 100, this corresponds to roughly a 96% similarity. For
a window length of 80, a `theta' score of 60 is roughly a 92%
similarity. However, there is no direct relationship between
`theta' score and similarity. d2 tends to punish differences
that are spread throughout two regions more harshly than
differences that are clustered together; d2 tends to punish repeats
more harshly than edit distance.
* Common word heuristic `-H', `common_word'. This is a heuristic
used to speed up clustering. Sequence `i' is clustered with `j'
if a window of length `window_len' of sequence `j' contains at
least `H' words of length 8 that are in `i'. The choice of `H'
can help speed up clustering.
For a window length of 100, if `theta' is 40, we can see that
that an `H' value of 72 is very safe to use. For a window length
of 80, if `theta' is 60, then an `H' value of 42 is safe to use
(provided there few sequences that are smaller than the window
length, which is why for ultra-cautious clustering I might use an
even lower value).
* The `-K', `--sample_threshold' parameter. 4 is the default.
Cookbook suggestions
--------------------
Below are some suggestions for parameters to use for different
sequences. However, again, I emphasise the importance of looking at your
data. Do you want short regions of very high similarity, or longish
regions of good but not necessarily excellent similarity. How
error-prone is your data? How long are the reads? It is only by
understanding your data that you can make sensible choices. Moreover, I
think you need to cluster with different sets of parameters and see
what you get. If you find that you get roughly similar results for a
range of different parameters then you can have confidence in the your
data.
Our recommendation is that you first cluster using relatively
aggressive heuristic parameters (e.g. `-H') with a range of different
biological parameters (e.g `--window_len' and `--theta'). Once you have
explored the effect of the biological parameters, you can choose one
and recluster with more conservative heuristic parameters that will
take a longer time to process.
* For Sanger style sequencing the following parameters are suggested
`-l 100 -T 40 -H 70'
This would be for good quality data for average sequence length
well above 300. If your data quality is not so good, you could
try relaxing the T value up to 60. An H value of 70 is
conservative but not ultra-conservative. You would get a
slightly better quality maybe by choosing H of 62 though
clustering will take a little longer.
* For 454 data (average sequence length of above 150), the following
has worked well