updates.txt

****nextpresso1.9, ene2017

1. I realised that seqtk consumes a lot of RAM memory if you ask for downsampling of reads. That's why I included another parameter in the configuration XML file:
'maximunNumberOfInstancesForDownSampling'
It prevents a RAM overload here when the number of executing instances is reduced through this parameter, but allows to execute the rest of the programs with a higher number of instances (as it is pointed out with the parameter maximunNumberOfInstancesAllowedToRunSimultaneouslyInOneParticularStep).

On the other hand, based on a publication I've just read, it's better to allow the user to control additional parameters of the cufflinks package, as the default options for these parameters
would lead to worst results (Have a look at the paper 'Errors in RNA-Seq quantification affect genes of relevance to human disease').

2. Creates a new file with cumulative variances when calculating correlation and PCA (for DESeq2, Cuffnorm and Cufflinks).
3. Creates a new rnk file from DESeq2 differential expression files 
4. Creates a new file with normalised counts for ALL samples together from DESeq2
5. Runs GSEA also on DESeq2 rnk files. This leads to two new folders: 'GSEA_Cuffdiff', with GSEA results based on Cuffdiff rnk files, and 'GSEA_DESeq' with GSEA results based on DESeq2 rnk files


****nextpresso1.8, may2016

A new check has been added to execution level 0. It now performs checksum validation for all the samples using md5sum, in the following way:
1. Checks if the same checksum value is shared by two different samples
	(i.e. two different sample files, with different names but, with the same content, generated by mistake)
2. If checksum values were provided by the sequencing facility, they can be validated against the samples by providing this file to nextpresso
	(check the experiment.xml file, a new attribute has been added to the 'experiment' xml element, named: 'fileWithChecksumCodesToValidate')
	(in case that you don't provide this file, only step 1 is done)


Additionally, a bug was solved in RNAseq.pl, in execution level 2, when none of the samples required trimming, i.e. additional quality check must be avoided.


****nextpresso1.7, apr2016

Several Perl packages from CRAN that are required by nextpresso are now included in the 'Utils' directory, avoiding having to install them.
(Suggested by Elena Pineiro, thanks Elena !!!)

The packages that are now included are:

Excel 
File 
GD 
Tree 
XML


****nextpresso1.6, ene2016

When creating the ranked file (rnk) for GSEA, all FPKM values are transformed as follows FPKM=original_FPKM+1. This transformation affects both FPKM1 and FPKM2.
It tries to avoid cases like the following:
test_id	gene_id	gene	locus	sample_1	sample_2	status	FPKM_1	FPKM_2	log2(fold_change)	test_stat	p_value	q_value	significant
ARHGAP19-SLIT1	ARHGAP19-SLIT1	ARHGAP19-SLIT1	chr10:98757794-99052430	Control	Treatment	NOTEST	1.10925E-42	5.20103E-136	-310.032	0	1	1	no
where two real very low FPKMs give raise to an unreal and extremely negative and log2(Fold-change), that in GSEA would be placed on the edge of one of the phenotypes, representing and artifact.

No other filtering is considered (like the one used in version 1.5 for cases with FPKM=0 in both conditions).


****nextpresso1.5, oct2015

When creating the ranked file (rnk) for GSEA, it filters out all the genes with FPKM=0 in both conditions (FPKM1 & FPKM2).
Why?
We found cases were a gene set is composed by a huge number of genes, and were FPKM1=0 and FPKM2=0 for all the genes in the gene set. This causes GSEA to return a significant result for this gene set, with the genes of the gene set appear totally in the middle of the GSEA image (it wouldn't make sense).


****nextpresso1.4, aug2015

RNAseq pipeline renamed as nextpresso!!!

By the way,  a bug related to the execution of HTseq-count with paired-end reads is now solved (thanks to Javier Perales, good catch!!)


****RNAseq_pipelinev1.4 release, jun2015

1. lines with FPKMs=0 in one of the FPKM values add '0.001' to both FPKMs (FPKM1 and FPKM2).
2. Cases where log2FC=0 (i.e. log2(1/1)), are explicitely unsorted to avoid artificial images (like the vertical block in the middle of the GSEA image) due to a sorting by gene name.
3. The runSamtools() method in htseqCount.pm has been modified to avoid lines in the sam file like the one below. These lines with '*' in the 3rd column cause htseqcount to break and stop execution (they are a very small number).
HWI-D00689_0078:7:1106:8961:27289#29222_GCCAAT	272	*	156	1	29M	*	0	0	TTAAAATGAACCTGCCGGCTGATCGTTTT	FFFFFFFFFFFFFFFFFFFFFFFFFFFFF	AS:i:-5	XM:i:1	XO:i:0	XG:i:0	MD:Z:38C11	NM:i:1	XF:Z:1 ERCC-00096-chr5 156 29M50643494F21m TTAAAATGAACCTGCCGGCTGATCGTTTTTTTTAGGATATTGTGAGTAAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBBB	NH:i:3	CC:Z:=	CP:i:156	XS:A:+	HI:i:0
4. Modification for cuffdiff and htseqcount when having spikes:
When having spikes, it is more appropriate to not consider them for cuffdiff and htseqcount as they could affect normalization values for regular genes. So in this case, the original GTF is given instead of the one with the combined annotation (genes+spikes).
The 'originalGTF' variable was introduced to aid with this.


****RNAseq_pipelinev1.3 release, apr2015

1. Adds DESeq2 execution: it implies a new modification in the experiment XML file.
The configuration XML file does not require a modification, because DESeq is added
as a new library that must be installed in R.


****RNAseq_pipelinev1.2 release, apr2015

1. Attribute 'useCuffmergeAssembly' from '<cuffmerge nThreads="4" useCuffmergeAssembly="false">'
has been moved to the cuffdiff element. This option refers to the fact of cuffdiff being using
the GTF assembly generated by cuffmerge instead of the original one. So, it makes more sense to
be an attribute of the cuffdiff element than of the cuffmerge element in the xml file.

2. Attribute 'numberOfThreads' in configuration.xml file has been renamed to 'maximunNumberOfInstancesAllowedToRunSimultaneouslyInOneParticularStep'.
This tremendous long name attribute makes much more sense in order for the parameter to be understood (I hope....)
Before it could be confused with the number of threads used, for example, by Tophat, or Cufflinks (-p parameter)

3. Temporary directories creation during execution (in /tmp/) are now named adding date and time to avoid colisions

4. Correlation tests, PCAs, and gct files are now created also for cuffnorm output files

5. The grep sentence attached below, included in cufflinks.pm in pipeline version v1.1,
to filter out cases in the reference/annotation like chr6_apd_hap1:174179-195170,
has been excluded here.
	grep -E 'chr[12]?[0-9XYxyMm]:' ...
Reason: it didn't allow the execution of cases where the reference/annotation in not like 'chr1', for example.
This happens when using particular references like scaffolds.


****RNAseq_pipelinev1.1 release, feb2015

1. The element <queueProject>MyProject</queueProject> from configuration.xml has been replaced by the 'projectName' attribute in experiment.xml.
It is used when running the pipeline in cluster queues, to get access to the allowed execution projects.
Both xml schemas (xsd) were modified accordingly.

2. '--max-bundle-frags' parameter has been added to cufflinks, cuffquant and cuffdiff.
You can modify its value through the attribute 'maxBundleFrags' in the experiment xml . Default value=500000.

3. It includes a new grep sentence in cufflinks.pm to filter out
cases in the reference/annotation like chr6_apd_hap1:174179-195170
grep -E 'chr[12]?[0-9XYxyMm]:' ...