Skip to content

Commit e30a93d

Browse files
committed
Initial commit
0 parents  commit e30a93d

File tree

5 files changed

+847
-0
lines changed

5 files changed

+847
-0
lines changed

README.md

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
# FunDiS Pipeline
2+
3+
FunDiS Pipeline is a suite of scripts intended to streamline the processing of Next-Generation Sequencing (NGS) data. The scripts can be run individually or as a whole to form a complete pipeline.
4+
5+
## Prerequisites
6+
7+
This application is designed to be run on a Linux/WSL environment and requires the following Python libraries:
8+
9+
- psutil
10+
- tqdm
11+
- pandas
12+
- pysam
13+
- biopython
14+
15+
The application also relies on the following tools:
16+
17+
- NGSpeciesID
18+
- bwa
19+
- samtools
20+
- bcftools
21+
- whatshap
22+
- medaka
23+
- openblas
24+
- spoa
25+
26+
Note: The application checks for the required Python libraries and tools during execution and attempts to install any missing dependencies.
27+
28+
## Running the Pipeline
29+
30+
Each module of the pipeline can be run individually or as a whole.
31+
32+
### Running the Whole Pipeline
33+
34+
To run the whole pipeline, use the `fundis_main.py` script. For example:
35+
36+
```
37+
python /path/to/fundis_main.py -i /path/to/input.fastq -t /path/to/primers.txt -p 80
38+
```
39+
40+
### Running Individual Modules
41+
42+
Each module can also be run individually. Here's what each module does:
43+
44+
- **fundis_minibar_ngsid.py**: This script processes the input FASTQ file with MiniBar and NGSpeciesID. MiniBar is a tool for demultiplexing barcoded read data and NGSpeciesID is a tool used for the identification of specimens in NGS datasets. The script starts by checking the operating system, installing missing libraries, and setting up the working environment. It then moves on to demultiplexing and identifying species from the input FASTQ data. The results are output in a directory specified by the user.
45+
46+
```
47+
python /path/to/fundis_minibar_ngsid.py -i /path/to/input.fastq -t /path/to/primers.txt -p 80
48+
```
49+
50+
- **fundis_haplotype_phaser.py**: This script takes the output from the `fundis_minibar_ngsid.py` script and phases the haplotypes for each sample. Phasing is the process of determining the specific set of variants found on each physical copy of a particular gene or genomic region. The phased haplotypes are output in the NGSpeciesID output directory.
51+
52+
```
53+
python /path/to/fundis_haplotype_phaser.py -i /path/to/input_directory -p 80
54+
```
55+
56+
- **fundis_summarize2.py**: This script summarizes the output from the `fundis_haplotype_phaser.py` script. It provides a summary of the results, including counts of unique samples, total consensus sequences, and total reads in consensus sequences. It also copies and updates the names of all FASTQ and consensus FASTA files. The results are output in a summary directory named after the source directory.
57+
58+
```
59+
python /path/to/fundis_summarize2.py -s /path/to/source_folder -p 80
60+
```
61+
62+
## Arguments
63+
64+
- `-i`, `--input_fastq`: Path to the FASTQ file containing ONT nrITS data.
65+
- `-t`, `--primers_text_path`: Path to Text file containing the Primers used to generate input_fastq.
66+
- `-p`, `--percent_system_use`: Percent system use written as integer.
67+
- `-s`, `--source_folder`: Path to the source folder.
68+
69+
## Author
70+
71+
Ian Michael Bollinger (ian.michael.bollinger@gmail.com)

fundis_haplotype_phaser.py

Lines changed: 230 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
# -*- coding: utf-8 -*-
2+
"""
3+
Created on Wed Jul 19 2023
4+
5+
@author: ian.michael.bollinger@gmail.com with the support of ChatGPT 4.0
6+
7+
This is a module intended to be used as a part of a pipeline.
8+
9+
This can be used individually by calling the command:
10+
python /path/to/fundis_haplotype_phaser.py -i /path/to/input_dir -p 80
11+
python /path/to/fundis_haplotype_phaser.py --input_dir /path/to/input_dir --percent_system_use 80
12+
"""
13+
14+
import os
15+
import glob
16+
import subprocess
17+
import pandas as pd
18+
import multiprocessing
19+
import math
20+
import queue
21+
import pysam
22+
import argparse
23+
from Bio import SeqIO
24+
from Bio.Seq import Seq
25+
from Bio.SeqRecord import SeqRecord
26+
27+
# Function to determine which Operating System the code is being executed in
28+
def check_os():
29+
global environment_dir
30+
global environment_cmd_prefix
31+
# Determine the operating system in use
32+
# os.name will return 'posix', 'nt', or 'java'
33+
os_name = os.name
34+
# platform.system() will return 'Linux', 'Windows', 'Java', etc.
35+
platform_system = platform.system()
36+
37+
# If the operating system is Windows (identified by 'nt' from os.name or 'Windows' from platform.system())
38+
if os_name == 'nt' or platform_system == 'Windows':
39+
# Set the working directory to "E:" (or whatever drive letter is appropriate for your Windows system)
40+
environment_dir = "E:"
41+
# For running Linux commands in Windows Subsystem for Linux (WSL), prefix the command with "wsl "
42+
environment_cmd_prefix = "wsl "
43+
44+
# If the operating system is Linux (identified by 'posix' from os.name or 'Linux' from platform.system())
45+
elif os_name == 'posix' or platform_system == 'Linux':
46+
# Set the working directory to "/mnt/e" (or whatever the corresponding path is in your Linux system)
47+
environment_dir = "/mnt/e"
48+
49+
else:
50+
# If the operating system is neither Windows nor Linux, raise an Exception
51+
raise Exception("ERROR: OS NOT TESTED WITH THIS CODE.")
52+
53+
# Print out the detected operating system and the determined environment directory
54+
print(f'Operating System: {platform_system}')
55+
print(f'Environment Directory: {environment_dir}')
56+
return environment_dir
57+
58+
# Function to extract sequence id and supporting reads count from a fasta file
59+
def extract_data_from_fasta(fasta_file):
60+
data = []
61+
for seq_record in SeqIO.parse(fasta_file, "fasta"):
62+
sequence_id = seq_record.id
63+
supporting_reads_count = int(sequence_id.split('_supporting_reads_')[1])
64+
data.append((sequence_id, supporting_reads_count))
65+
return data
66+
67+
# Main function to process a folder
68+
def analyze_ngspeciesid_folder(ngspeciesid_dir):
69+
# Find all fasta files in the input directory
70+
fasta_files = glob.glob(os.path.join(ngspeciesid_dir, "*.fasta"))
71+
72+
# If no fasta files, return
73+
if not fasta_files:
74+
return
75+
76+
# # Extract data from all fasta files
77+
# data = []
78+
# for fasta_file in fasta_files:
79+
# data.extend(extract_data_from_fasta(fasta_file))
80+
81+
# # Create a dataframe from the extracted data
82+
# df = pd.DataFrame(data, columns=['sequence_id', 'reads_count'])
83+
84+
# # Find the index of the sequence with the maximum reads
85+
# index_max_reads = df['reads_count'].idxmax()
86+
87+
# # Get the input fasta file
88+
# input_fasta = fasta_files[index_max_reads]
89+
90+
# Run through each file
91+
for input_fastq in fasta_files:
92+
if os.path.isdir(base_folder):
93+
print(f"PASS: Skipping haplotype split, {base_folder} already exists\n")
94+
else:
95+
# # Define the base folder
96+
base_folder = input_fasta.rsplit('.', 1)[0]
97+
os.makedirs(base_folder, exist_ok=True) # add exist_ok=True
98+
os.chdir(base_folder)
99+
100+
input_fastq = input_fasta.replace('consensus_reference','reads_to_consensus').replace('.fasta','.fastq')
101+
102+
fasta_basename = os.path.basename(input_fasta)
103+
104+
sam_file = os.path.join(base_folder, fasta_basename.replace('.fasta','.sam'))
105+
106+
bam_file = os.path.join(base_folder, sam_file.replace('.sam','.bam'))
107+
108+
sorted_bam = os.path.join(base_folder, bam_file.replace('.bam','_sorted.bam'))
109+
110+
bcf_file = os.path.join(base_folder, sorted_bam.replace('_sorted.bam','.bcf'))
111+
112+
pileup_output = os.path.join(base_folder, bcf_file.replace(".bcf", "_pileup.bcf"))
113+
114+
vcf_file = os.path.join(base_folder, bcf_file.replace(".bcf",'.vcf'))
115+
116+
phased_vcf = os.path.join(base_folder, vcf_file.replace('.vcf','.phased.vcf'))
117+
118+
haplotype_fasta = os.path.join(base_folder, phased_vcf.replace('.phased.vcf','_haplotypes.fasta'))
119+
120+
# Step 1: Use BWA to align your FASTQ file to your reference FASTA file
121+
# First, index the reference genome
122+
subprocess.run(["bwa", "index", input_fasta], check=True)
123+
124+
# Then, align the reads to the reference
125+
subprocess.run(["bwa", "mem", input_fasta, input_fastq, "-o", sam_file], check=True)
126+
127+
# Step 2: Convert SAM to BAM using SAMtools
128+
subprocess.run(["samtools", "view", "-S", "-b", sam_file, "-o", bam_file], check=True)
129+
130+
# Step 3: Sort the BAM file
131+
subprocess.run(["samtools", "sort", bam_file, "-o", sorted_bam], check=True)
132+
133+
# Add this line to create the index file for the sorted BAM file:
134+
subprocess.run(["samtools", "index", sorted_bam], check=True)
135+
136+
# Step 4: Call variants with bcftools to get a VCF file
137+
138+
subprocess.run(["bcftools", "mpileup", "-Ou", "-f", input_fasta, sorted_bam, "-o", pileup_output], check=True)
139+
subprocess.run(["bcftools", "call", "-mv", "-Ob", "-o", bcf_file, pileup_output], check=True)
140+
subprocess.run(["bcftools", "view", bcf_file, "-Ov", "-o", vcf_file], check=True)
141+
142+
# Step 5: Use Whatshap to find haplotypes
143+
subprocess.run(["whatshap", "phase", "-o", phased_vcf, "--reference", input_fasta, vcf_file, sorted_bam], check=True)
144+
145+
# Step 6: Parse the output VCF file and generate sequences based on the original input FASTA
146+
# First, let's load the original reference FASTA file
147+
original_seq_record = SeqIO.read(input_fasta, "fasta")
148+
original_seq = str(original_seq_record.seq)
149+
150+
# Then, let's parse the VCF file using pysam
151+
vcf = pysam.VariantFile(phased_vcf)
152+
153+
new_sequences = []
154+
155+
# Let's iterate over each variant in the VCF file
156+
for variant in vcf:
157+
# We'll create a new sequence for each allele
158+
for allele in variant.alleles:
159+
if allele != variant.ref: # We don't need to create a new sequence for the reference allele
160+
new_seq_str = original_seq[:variant.pos-1] + allele + original_seq[variant.pos:]
161+
record_id = variant.id if variant.id is not None else f"pos_{variant.pos}_allele_{allele}"
162+
new_seq_record = SeqRecord(Seq(new_seq_str), id=record_id, description="")
163+
new_sequences.append(new_seq_record)
164+
165+
# Finally, let's write the new sequences to a new FASTA file
166+
SeqIO.write(new_sequences, haplotype_fasta, "fasta")
167+
168+
# Initialize an empty list to store all haplotype sequences
169+
haplotype_sequences = []
170+
171+
# Use glob to find all haplotype fasta files
172+
haplotype_files = glob.glob(os.path.join(base_folder, '**', '*haplotypes.fasta'), recursive=True)
173+
174+
# Take Haplotype File and save individual fasta files for each haplotype that consist of a single degenerate sequence
175+
for haplotype_file in haplotype_files:
176+
for i, record in enumerate(SeqIO.parse(haplotype_file, "fasta")):
177+
# Create a new fascondata file for each sequence
178+
individual_fasta_path = os.path.join(os.path.dirname(haplotype_file), f"{record.id}_individual.fasta")
179+
with open(individual_fasta_path, "w") as individual_fasta_file:
180+
SeqIO.write(record, individual_fasta_file, "fasta")
181+
182+
# Remove files that end in '.bcf', '_sorted.vcf', '.amb', '.ann', '.bwt', '.pac', '.sa', and '.bai' from the base folder
183+
exts = ['.bcf', '_sorted.vcf', '.amb', '.ann', '.bwt', '.pac', '.sa', '.bai']
184+
185+
# Iterate over each extension
186+
for ext in exts:
187+
# Use glob to find all files with the current extension, including subfolders
188+
file_list = glob.glob(os.path.join(ngspeciesid_dir, '**', '*' + ext), recursive=True)
189+
190+
# Iterate over each file and remove it
191+
for file_path in file_list:
192+
os.remove(file_path)
193+
return True
194+
195+
# Main entry point of the script
196+
def main(args):
197+
# Define the main directory based on the argument or use a default value
198+
input_dir = args.input_dir if args.input_dir else '/mnt/e/Fundis/NGSpeciesID-20230719T211158Z-001/NGSpeciesID'
199+
percent_system_use = float(args.percent_system_use/100) if args.percent_system_use else 0.8
200+
201+
# Get the number of CPUs and calculate the number of threads
202+
num_cpus = multiprocessing.cpu_count()
203+
cpu_threads = int(math.floor(num_cpus * percent_system_use))
204+
205+
# Get a list of all folders in the main directory, excluding folders named '__Summary__' or 'Summary2'
206+
folder_list = [f.path for f in os.scandir(input_dir) if f.is_dir() and '__Summary__' not in f.path and 'Summary2' not in f.path]
207+
208+
try:
209+
# Initialize a multiprocessing pool and process all folders
210+
with multiprocessing.Pool(cpu_threads) as pool:
211+
pool.map(analyze_ngspeciesid_folder, folder_list)
212+
except queue.Empty:
213+
print("Queue is empty. All Pipeline tasks have been processed.")
214+
finally:
215+
print('PASS: Haplotypes phased for all RiC >=9 conesensus fasta for each sample in NGSPeciesID input_dir\n')
216+
217+
# If this script is the main entry point, parse the arguments and call the main function
218+
if __name__ == "__main__":
219+
# Global environment_dir
220+
environment_dir = ""
221+
environment_cmd_prefix = ""
222+
environment_dir = check_os()
223+
main_working_dir = os.getcwd()
224+
225+
# Parse user arguments
226+
parser = argparse.ArgumentParser(description="Process NGSpeciesID main folder.")
227+
parser.add_argument('-i','--input_dir', type=str, help='Path to the NGSpeciesID main directory')
228+
parser.add_argument('-p','--percent_system_use', type=str, help='Percent system use written as integer.')
229+
args = parser.parse_args()
230+
main(args)

0 commit comments

Comments
 (0)