A Snakemake pipeline for third-generation sequence assembly, designed to automate the processing and assembly of third-generation sequencing data through quality control, assembly, polishing, and post-processing.
Item | Details |
---|---|
Name | Third generation sequence assembly |
Date | 2023-04-24 |
Description | Snakemake pipeline for automated assembly of third-generation sequencing data, including quality control, de novo assembly, multi-round polishing, and genome optimization. |
Author | Tong Xu |
The pipeline relies on the following tools (specific versions recommended):
fastp = 0.23.2
(short-read quality control)flye = 2.9.2
(third-generation sequence de novo assembly)minimap2 = 2.24
(sequence alignment)samtools = 1.17
(BAM file processing)pilon = 1.24
(genome polishing)prodigal
(ORF prediction for post-processing)
Input data must follow this structure (relative to base_dir
in the script):
SG_reads/{sample}/
: Short-read (second-generation) data, named as{sample}_1.fq
(R1) and{sample}_2.fq
(R2).TG_reads/{sample}/
: Third-generation sequencing data, named as{sample}.filtered_reads.fq
.
Automatically generated output directories:
fastp_output/
: Quality-controlled short reads and fastp reports (JSON/HTML).flye_output/
: Initial assembly results from Flye.minimap2_flye_output/
: Alignment files (BAM) generated by minimap2.pilon_flye_output/
: Polished genomes after each Pilon round.TG_unfixed_genome/
: Renamed contigs before final optimization.prodigal_output/
: ORF prediction results (GFF format) from Prodigal.TG_genome/
: Final optimized genome assemblies.
The pipeline runs in the following sequential steps via Snakemake rules:
- Quality Control with fastp Trims low-quality bases, adapters, and filters short reads from short-read data. Outputs cleaned reads and quality reports (JSON/HTML).
- De novo Assembly with Flye
Assembles third-generation reads into initial contigs. Key parameters:
- Genome size: ~4.5m (optimized for
Phaeobacter gallaeciensis
; adjust based on target species). - Coverage limit: 300x (prevents assembly failure due to excessive depth).
- Genome size: ~4.5m (optimized for
- Multi-round Genome Polishing
- minimap2: Aligns cleaned short reads to the assembled genome, generating sorted BAM files.
- Pilon (3 rounds): Polishes the genome using short-read alignments to correct SNPs and indels, with a minimum depth threshold of 10 for reliable corrections.
- Contig Renaming
Renames contigs to standardized identifiers (e.g.,
{sample}_chromosome_1
,{sample}_plasmid_1
) based on assembly metrics (length, coverage, circularity fromassembly_info.txt
). - ORF Prediction with Prodigal Predicts open reading frames (ORFs) from renamed contigs to guide genome optimization.
- Circular Contig Fixing Adjusts circular contigs to repair potential broken genes by reorienting sequences based on ORF positions (moves non-ORF regions to the end of contigs).
-
Prepare Input Data: Organize short-read and third-generation data into the required directory structure.
-
Configure Paths: Ensure
base_dir
(root directory) andpilon_jar
(path to Pilon JAR file) in the script are correctly set. -
Run the Pipeline:
Execute with Snakemake, specifying the number of cores:
bash
snakemake --cores [number_of_cores]
- Genome Size Adjustment: Modify the
genomeSize
parameter in the Flye rule for non-Phaeobacter gallaeciensis
species. - Polishing Stringency: Pilon uses
--mindepth 10
to filter low-coverage variants; adjust based on data quality. - Circular Contigs: The final step fixes potential gene breaks in circular contigs, improving genome completeness for downstream analysis.