This repository contains demonstration data and results for the Sequeduct pipeline. Guidelines for interpreting the results are also provided here.
Example files and results (results_example
) can be browsed here or after downloading (cloning) the repository.
Certain intermediate analysis files in a few directories were not included (i.e. these directories are empty), due to the size of these generated files.
The pipeline can also be run on the provided data. After installation of the pipeline as described on the Sequeduct website, download (clone) the repository (click on the "<> Code" button at the top of this page) and run the below commands.
All results are output in a newly created results
directory.
nextflow run edinburgh-genome-foundry/Sequeduct -r v0.3.1 -entry preview \
--fastq_dir='fastq_pass' \
--sample_sheet='sample_sheet.csv' \
-profile docker
The Preview pipeline runs a NanoPlot analysis on the raw reads. This is useful for getting an overview of our sequencing data. The results are saved in the results/dir1_preview
directory.
The pipeline requires a sample sheet that lists the selected barcodes (directories in the FASTQ folders) that we want to analyse. This allows us to analyse a subset of the data, for example when not all barcodes are used, or when we have multiple projects in the same flowcell run. The sheet must come with the header line as shown in the example.
Please see NanoPlot documentation for details.
It's recommended to examine the plots in order to check the overall quality of the run. The histograms provide information on the number of reads obtained, the quality distribution and the potential presence of plasmid dimers. With this information, we can proceed with selected samples in the Analysis pipeline.
nextflow run edinburgh-genome-foundry/Sequeduct -r v0.3.1 -entry analysis \
--fastq_dir='fastq_pass' \
--reference_dir='genbank' \
--sample_sheet='sample_sheet.csv' \
--projectname='EGF demo' \
-profile docker
The Analysis pipeline compares the Nanopore reads against the expected (designed) sequence.
The pipeline requires the following input data: reference Genbank files (in standard format, and with extension ".gb"), the FASTQ file folder, and a sample sheet that maps reference filenames (without extension) with the subdirectories in the FASTQ folder. (EGF's Convert Sequence Files can convert FASTA, Genbank or other formats into the required standard format.) As the pipeline was designed to work with multiple barcodes, the FASTQ folder must be structured with subdirectories for each barcode.
The output files are saved in the results/dir2_analysis
directory. In this, n7_results
contain the final results: a PDF report, a list of the sample analysed, and a summary of the results in the CSV file format. The PDF report describes the results in detail, as shown below (see its Appendix for a detailed description). The sample list details the relevant result files for for each analysed sample entry (entries.csv). The results summary file can be opened in a spreadsheet software and revised based on the report, following the procedure described below.
The report is structured into chapters, one for each FASTQ subdirectory (i.e. barcode). Note that the reads are not required to be derived from barcoded samples, as the pipeline does not utilise any barcoding information. Therefore any set of FASTQ files can be used, whether they originated from non-barcoded sequencing, or from demultiplexed data from a custom barcoding protocol. Each chapter consists of one or more sections that each describe the results for a given reference (plasmid). This structure allows future expansion of the pipeline to multiplexed samples, but currently this is not implemented.
Report cover page:
The cover page contains basic information about the run, such as the number of reads analysed.
Chapter cover page:
The first line summarises the results for the FASTQ directory, followed by a table of key statistics. A histogram of filtered read lengths is displayed. If we have mostly full length reads (e.g. linearised plasmids or plasmids from rapid barcoding kit), then the peak can be used for estimating the size of the plasmid. Wrong size indicates large structural variation (insertion / deletion), sample mixup or errors in the sample spreadsheet.
Section (first page): coverage plot
Each reference DNA (plasmid) is analysed separately, with its corresponding reads. A result call (fail / pass / low coverage / warning) is assigned to each section. It's recommended to revise this, based on the contents of the section. A coverage plot is displayed under the reference sequence (plasmid) map. In this example of a DNA construct properly assembled from parts, we have ~500x coverage, evenly covering the full length of the sequence, therefore we can exclude large deletions. Note that to include tolerance and due to potential adapter sequences, unaligned read cutoff is set to 100 bp and shown with a grey vertical line.
The cumulative plot of longest unaligned intervals also suggests that there are no regions in the reads that don't align to the reference, suggesting that there are no large insertions either.
Finally, a simplified variant call format (VCF) table lists all detected small variants (SNPs and indels). Homopolymer stretches are known to produce systemic sequencing errors, therefore these are not considered true variants. This is confirmed by the disagreement between the reads: as seen in the reference / alternate allele observation counts (RO / AO columns).
We find one point mutation at position 1836.
Section (second page): variant plot
The second page shows the plasmid map with variant annotations, for an easy overview. (Variants at homopolymers are ignored as explained above.) A consensus file of true variants is also provided for each barcode. This consensus is provided in the FASTA format and is derived from the reference by applying the variant modifications.
The second example plasmid was flagged as a failed sample. The histogram of reads suggests that the plasmid is smaller than what we expect.
This was a failed plasmid assembly. We see that there is no coverage for feature_20, indicating that there is an assembly error. The insert plot shows that the majority of reads have a non-aligning segment of ~1700 bp. This suggests that some other DNA part got assembled into the plasmid, instead of feature_20. Further analysis and explanation is provided in the Review section below.
We also have the same point mutation present as in the previous example, however, note that variant call does not detect structural variants such as large deletions or insertions. In this case, the consensus FASTA sequence should not be used.
This construct used the same DNA part (feature_8) as the previous example, which suggests that the point mutation was present in the DNA part originally and is not created during the cloning and amplification work.
Please see the Appendix of the PDF report, the publication, and the Nextflow pipeline code and documentation for more details.
Key pipeline parameters can be set by the user. The list of parameters and their default settings can be found in the nextflow.config file. The default parameters work very well for most sequencing runs, but we provide a few suggestions below for setting different values, depending on use-case.
--max_len_fraction=1.5
: the maximum read length cutoff is used to filter reads by NanoFilt. The default is 1.5x of the reference sequence length. Set this to a higher value to include plasmid dimers, or if the reference sequence is a short subsegment of the sequenced DNA.
--min_length=500
: the value is in number of nucleotides (bp). Most plasmids are longer than 1 kbp and linearisation or fragmentation results in mostly full length reads. Set the value lower if the DNA is shorter or the reads are more fragmented.
--quality_cutoff=10
: this PHRED quality score cutoff parameter is used by NanoFilt. Set to a higher value to work with better reads or if there is an overabundance of reads. Set to a lower value to use more reads if the reads are lower quality than usual, e.g. due to an issue with a sequencing run or sequencer software. Results of the Preview pipeline can be used for determining an appropriate cutoff.
--freebayes.args
: these are used by the freebayes variant detector. Set --min-base-quality
PHRED score cutoff higher/lower to use fewer/more bases during variant call. The variant call table (DP column) in the report can help in determining a different cutoff. If the values in the DP column are very low in contrast to high sequencing depth or coverage, then try a lower value such as 15. In most cases the default will be the most suitable.
- Open a copy of
results.csv
in a spreadsheet software. - Create a new column and review the results for each sample in the PDF report.
- Check number of reads and coverage. Insufficient coverage is usually marked with 'low coverage'.
- On the histogram, check read length distribution (fragmentation). If we have sufficient number of full-length reads, then compare the peak against the expected length (vertical red line). Differences can indicate large indels.
- Inspect the coverage plot for uncovered regions, which indicate deletions.
- Interpret the cumulative plot of longest unaligned intervals to find large inserts.
- Review the variant call format table and note variants. Note that large indels are not displayed in the table.
- Reject or accept samples, depending on the requirements of the project.
- If a large insertion or deletion is found, perform a de novo assembly on the sample. For large insertions, the Review pipeline can help in clarifying the nature of the error.
nextflow run edinburgh-genome-foundry/Sequeduct -r v0.3.1 -entry review \
--reference_dir='genbank' \
--results_csv='results_reviewed.csv' \
--projectname='EGF demo review' \
--all_parts='parts_fasta/parts.fasta' \
--assembly_plan='demo_assembly_plan.csv' \
-profile docker
The Review pipeline aligns a user-defined list of sequences against a de novo plasmid sequence, and then reports the alignments. This is useful for evaluating plasmids that are constructed from parts to clarify whether we have part or sample mix-ups, recombination events or overhang misannealing.
The de novo consensus sequence is provided in the FASTA format and is assembled entirely from the reads, using Canu, without utilising the reference file.
This pipeline can only be run after running the Analysis pipeline, as it uses the generated files. It requires the reference Genbank files, the sequences in a single FASTA file, and a sheet specifying which samples we want to review. This sample sheet is the results.csv
file from the Analysis run, with requested samples marked with 1
in the Review_de_novo
column. (Other marker value can be set with the --denovo_true
parameter.) There is also a Review_consensus
column for specifying samples to be analysed using the variant call consensus sequences, but this is not recommended due to issues mentioned above. Optionally, an assembly plan can be specified, which lists which sequences we expect to be present for each reference sequence. This information is used in the report for an easier interpretation of results.
The results are saved in the results/dir3_review
directory. Please see the Appendix of the consensus review PDF report, or the de novo review PDF report for a description.
The report greets the user with a cover page:
This is followed by one chapter for each sample:
On this page, a plot of the aligning parts is displayed against the de novo assembled sequence. There are no feature annotations as the assembly was created from the reads, and the parts were supplied in unannotated FASTA format. The reference sequence is also aligned against the de novo assembly, shown in grey. The annotations are coloured based on the assembly plan. Green colour indicates expected parts, and red colour indicates unexpected parts.
We can see that the green parts cover the full region of the de novo sequence, and the grey reference also fully aligns, indicating that the assembly is correct. We note that the sequence assembled from reads do not (necessarily) have the same origin as our reference. (Circular sequences are stored in a "linear" format as a sequence of letters, and the first nucleotide is the origin of the sequence, as stored in the file.) The DNA parts were in a carrier backbone plasmid, and this was also supplied with the name 'part_carrier'. The ori region of this carrier plasmid is nearly identical to the ori of the backbone, 'HC_Amp_ccdB', therefore it aligns and shows on the plot. The software automatically recognises if the de novo assembly is in reverse complement to the reference, as noted on the top of the page.
The next chapter shows results for the failed assembly product:
We can see that the part carrier backbone, displayed in red, was assembled instead of the intended insert. Upon inspection of the corresponding de novo assembly FASTA, we can find the presence of KanR, a resistance marker of the carrier plasmid, and a recognition site for the restriction enzyme used for the assembly (BsmBI). A separate part carrier ori sequence alignment is also present, as discussed above.
nextflow run edinburgh-genome-foundry/Sequeduct -r v0.3.1 -entry assembly \
--fastq_dir='fastq_pass' \
--assembly_sheet='de_novo_assembly_sheet.csv' \
-profile docker
The standalone Assembly pipeline creates de novo assembly sequences, without any reference files. It requires the FASTQ files, and a sample sheet listing the barcodes and corresponding expected DNA (plasmid) length (in kbp). The results are saved in the results/dir4_assembly
directory.
Note that sometimes the assembled sequence is made up of two consecutive sequences of the reference (with double length). This "duplication" happens when reads are derived from random segments of a circular sequence, such as a plasmid, and joined by an assembler. Canu, the assembler, tries to identify whether the sequence is duplicated, but this automatic identification is not always successful. The result of the identification is shown in the suggestCircular
parameter of the consensus FASTA header (sequence name).
Biofoundry-scale DNA assembly validation using cost-effective high-throughput long-read sequencing, Peter Vegh, Sophie Donovan, Susan Rosser, Giovanni Stracquadanio, Rennos Fragkoudis. ACS Synthetic Biology (2024) 13, 2, 683–686
Please use the Sequeduct project's page to file any issues or comments. See the main page for other ways of contact.
Copyright 2023 Edinburgh Genome Foundry, University of Edinburgh