Skip to content

Commit 15d471e

Browse files
committed
modified: README.md
modified: TIDDIT.py modified: src/DBSCAN.py modified: src/TIDDIT.cpp modified: src/TIDDIT_calling.py modified: src/TIDDIT_filtering.py modified: src/common.h modified: src/data_structures/CoverageModule.cpp modified: src/data_structures/ProgramModules.h modified: src/data_structures/Translocation.cpp modified: src/data_structures/findTranslocationsOnTheFly.cpp
1 parent b817a7a commit 15d471e

12 files changed

+204
-94
lines changed

CMakeLists.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ include_directories("${PROJECT_SOURCE_DIR}/lib/bamtools/src")
2222
file(GLOB TIDDIT_FILES
2323
${PROJECT_SOURCE_DIR}/src/TIDDIT.cpp
2424
${PROJECT_SOURCE_DIR}/src/data_structures/Translocation.cpp
25-
${PROJECT_SOURCE_DIR}/src/data_structures/findTranslocationsOnTheFly.cpp
25+
${PROJECT_SOURCE_DIR}/src/data_structures/findTranslocationsOnTheFly.cpp
2626
${PROJECT_SOURCE_DIR}/src/common.h
2727
${PROJECT_SOURCE_DIR}/src/data_structures/CoverageModule.cpp
2828
)

README.md

Lines changed: 40 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,13 @@ DESCRIPTION
22
==============
33
TIDDIT: Is a tool to used to identify chromosomal rearrangements using Mate Pair or Paired End sequencing data. TIDDIT identifies intra and inter-chromosomal translocations, deletions, tandem-duplications and inversions, using supplementary alignments as well as discordant pairs.
44

5-
TIDDIT has two modes of analysing bam files. The sv mode, which is used to search for structural variants. And the cov mode that analyse the read depth of a bam file and generates a coverage report.
5+
TIDDIT has two analysis modules. The sv mode, which is used to search for structural variants. And the cov mode that analyse the read depth of a bam file and generates a coverage report.
66

77

88
INSTALLATION
99
==============
1010
TIDDIT requires standard c++/c libraries, python 2.7 or 3.6, cython, and Numpy. To compile TIDDIT, cmake must be installed.
11+
samtools is reuquired for reading cram files (but not for reading bam).
1112

1213
```
1314
git clone https://github.com/SciLifeLab/TIDDIT.git
@@ -59,15 +60,26 @@ The SV module
5960
=============
6061
The main TIDDIT module, detects structural variant using discordant pairs, split reads and coverage information
6162

62-
python TIDDIT.py --sv [Options] --bam bam
63+
python TIDDIT.py --sv [Options] --bam in.bam
6364

64-
Optionally, TIDDIT acccepts a reference fasta for GC cocrrection:
65+
66+
TIDDIT support streaming of the bam file:
67+
68+
samtools view -buh in.bam | python TIDDIT.py --sv [Options] --bam /dev/stdin
69+
70+
Optionally, TIDDIT acccepts a reference fasta for GC correction:
6571

6672
python TIDDIT.py --sv [Options] --bam bam --ref reference.fasta
6773

6874

75+
Reference is required for analysing cram files:
76+
77+
python TIDDIT.py --sv [Options] --bam in.cram --ref reference.fasta
6978

70-
Where bam is the input bam file. And reference.fasta is the reference fasta used to align the sequencing data: TIDDIT will crash if the reference fasta is different from the one used to align the reads. The reads of the input bam file must be sorted on genome position, and the bam file needs to be indexed.
79+
80+
Where bam is the input bam or cran file. And reference.fasta is the reference fasta used to align the sequencing data: TIDDIT will crash if the reference fasta is different from the one used to align the reads. The reads of the input bam file must be sorted on genome position.
81+
82+
The reference is required for analysing cram files.
7183

7284
NOTE: It is important that you use the TIDDIT.py wrapper for SV detection. The TIDDIT binary in the TIDDIT/bin folder does not perform any clustering, it simply extract SV signatures into a tab file.
7385

@@ -100,8 +112,10 @@ TIDDIT SV module produces three output files, a vcf file containing SV calls, a
100112

101113
Useful settings:
102114

115+
103116
In noisy datasets you may get too many small variants. If this is the case, then you may increase the -l parameter, or set the -i parameter to a high value (such as 2000) (on 10X linked read data, I usually set -l to 5).
104-
117+
118+
105119
The cov module
106120
==============
107121
Computes the coverge of different regions of the bam file
@@ -114,13 +128,14 @@ optional parameters:
114128
-z - compute the coverage within bins of a specified size across the entire genome, default bin size is 500
115129
-u - do not print per bin quality values
116130
-w - generate a wig file instead of bed
131+
--ref - reference sequence (fasta), required for reading cram file.
117132

118133
Filters
119134
=============
120135
TIDDIT uses four different filters to detect low quality calls. The filter field of variants passing these tests are set to "PASS". If a variant fail any of these tests, the filter field is set to the filter failing that variant. These are the four filters empoyed by TIDDIT:
121136

122137
Expectedlinks
123-
Less than 20% of the spanning pairs/reads support the variant
138+
Less than <p_ratio> fraction of the spanning pairs or <r_ratio> fraction reads support the variant
124139
FewLinks
125140
The number of discordant pairs supporting the variant is too low compared to the number of discordant pairs within that genomic region.
126141
Unexpectedcoverage
@@ -178,6 +193,25 @@ svdb --merge --vcf file1.vcf file2.vcf --bnd_distance 500 --overlap 0.6 > merged
178193

179194
Merging of vcf files could be useful for tumor-normal analysis or for analysing a pedigree. But also to combine the output of multiple callers.
180195

196+
Tumor normal example
197+
===================
198+
199+
run the tumor sample using a lower ratio treshold (to allow for subclonal events, and to account for low purity)
200+
201+
python TIDDIT.py --sv --p_ratio 0.10 --bam tumor.bam -o tumor --ref reference.fasta
202+
grep -E "#|PASS" tumor.vcf > tumor.pass.vcf
203+
204+
run the normal sample
205+
206+
python TIDDIT.py --sv --bam normal.bam -o normal --ref reference.fasta
207+
grep -E "#|PASS" normal.vcf > normal.pass.vcf
208+
209+
merge files:
210+
211+
svdb --merge --vcf tumor.pass.vcf normal.pass.vcf --bnd_distance 500 --overlap 0.6 > Tumor_normal.vcf
212+
213+
The output vcf should be filtered further and annotated (using a local-frequency database for instance)
214+
181215
Annotation
182216
==========
183217
genes may be annotated using vep or snpeff. NIRVANA may be used for annotating CNVs, and SVDB may be used as a frequency database

TIDDIT.py

Lines changed: 38 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
sys.path.insert(0, '{}/src/'.format(wd))
99
import TIDDIT_calling
1010

11-
version = "2.10.0"
11+
version = "2.11.0"
1212
parser = argparse.ArgumentParser("""TIDDIT-{}""".format(version),add_help=False)
1313
parser.add_argument('--sv' , help="call structural variation", required=False, action="store_true")
1414
parser.add_argument('--cov' , help="generate a coverage bed file", required=False, action="store_true")
@@ -27,13 +27,16 @@
2727
parser.add_argument('-Q', type=int,default=20, help="Minimum regional mapping quality (default 20)")
2828
parser.add_argument('-n', type=int,default=2, help="the ploidy of the organism,(default = 2)")
2929
parser.add_argument('-e', type=int, help="clustering distance parameter, discordant pairs closer than this distance are considered to belong to the same variant(default = sqrt(insert-size*2)*12)")
30-
parser.add_argument('-l', type=int,default=3, help="min-pts parameter (default=3),must be set > 2")
31-
parser.add_argument('-s', type=int,default=20000000, help="Number of reads to sample when computing library statistics(default=50000000)")
30+
parser.add_argument('-l', type=int,default=3, help="min-pts parameter (default=3),must be set >= 2")
31+
parser.add_argument('-s', type=int,default=25000000, help="Number of reads to sample when computing library statistics(default=25000000)")
3232
parser.add_argument('-z', type=int,default=100, help="minimum variant size (default=100), variants smaller than this will not be printed ( z < 10 is not recomended)")
3333
parser.add_argument('--force_ploidy',action="store_true", help="force the ploidy to be set to -n across the entire genome (i.e skip coverage normalisation of chromosomes)")
34+
parser.add_argument('--no_cluster',action="store_true", help="Run only the TIDDIT signal extraction")
3435
parser.add_argument('--debug',action="store_true", help="rerun the tiddit clustering procedure")
3536
parser.add_argument('--n_mask',type=float,default=0.5, help="exclude regions from coverage calculation if they contain more than this fraction of N (default = 0.5)")
36-
parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction")
37+
parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction and for reading cram")
38+
parser.add_argument('--p_ratio', type=float,default=0.2, help="minimum discordant pair/normal pair ratio at the breakpoint junction(default=20%)")
39+
parser.add_argument('--r_ratio', type=float,default=0.1, help="minimum split read/coverage ratio at the breakpoint junction(default=10%)")
3740

3841
args= parser.parse_args()
3942
args.wd=os.path.dirname(os.path.realpath(__file__))
@@ -44,19 +47,29 @@
4447
if not os.path.isfile(args.ref):
4548
print ("error, could not find the reference file")
4649
quit()
47-
if not args.bam.endswith(".bam"):
48-
print ("error, the input file is not a bam file, make sure that the file extension is .bam")
50+
51+
if not args.ref and args.bam.endswith(".cram"):
52+
print("error, reference fasta is required when using cram input")
4953
quit()
5054

51-
if not os.path.isfile(args.bam):
55+
if not (args.bam.endswith(".bam") or args.bam.endswith(".cram")) and not "/dev/" in args.bam:
56+
print ("error, the input file is not a bam file, make sure that the file extension is .bam or .cram")
57+
quit()
58+
59+
if not os.path.isfile(args.bam) and not "/dev/" in args.bam:
5260
print ("error, could not find the bam file")
5361
quit()
5462

5563
if not os.path.isfile("{}/bin/TIDDIT".format(args.wd)):
5664
print ("error, could not find the TIDDIT executable file, try rerun the INSTALL.sh script")
5765
quit()
5866

59-
command_str="{}/bin/TIDDIT --sv -b {} -o {} -p {} -r {} -q {} -n {} -s {}".format(args.wd,args.bam,args.o,args.p,args.r,args.q,args.n,args.s)
67+
if not args.bam.endswith(".cram"):
68+
command_str="{}/bin/TIDDIT --sv -b {} -o {} -p {} -r {} -q {} -n {} -s {}".format(args.wd,args.bam,args.o,args.p,args.r,args.q,args.n,args.s)
69+
else:
70+
command_str="samtools view -hbu {} -T {} | {}/bin/TIDDIT --sv -b /dev/stdin -o {} -p {} -r {} -q {} -n {} -s {}".format(args.bam,args.ref,args.wd,args.o,args.p,args.r,args.q,args.n,args.s)
71+
72+
6073
if args.i:
6174
command_str += " -i {}".format(args.i)
6275
if args.d:
@@ -74,7 +87,8 @@
7487
os.system("cat {} | {}/bin/TIDDIT --gc -z 50 -o {}".format(args.ref,args.wd,args.o))
7588
print ("Constructed GC wig in {} sec".format(time.time()-t))
7689

77-
TIDDIT_calling.cluster(args)
90+
if not args.no_cluster:
91+
TIDDIT_calling.cluster(args)
7892

7993
elif args.cov:
8094
parser = argparse.ArgumentParser("""TIDDIT --cov --bam inputfile [-o prefix]""")
@@ -84,9 +98,23 @@
8498
parser.add_argument('-z', type=int,default=500, help="use bins of specified size(default = 500bp) to measure the coverage of the entire bam file, set output to stdout to print to stdout")
8599
parser.add_argument('-w' , help="generate wig instead of bed", required=False, action="store_true")
86100
parser.add_argument('-u' , help="skip per bin mapping quality", required=False, action="store_true")
101+
parser.add_argument('--ref', type=str, help="reference fasta, used for GC correction and for reading cram")
87102
args= parser.parse_args()
88103
args.wd=os.path.dirname(os.path.realpath(__file__))
89-
command="{}/bin/TIDDIT --cov -b {} -o {} -z {}".format(args.wd,args.bam,args.o,args.z)
104+
105+
if args.ref:
106+
if not os.path.isfile(args.ref):
107+
print ("error, could not find the reference file")
108+
quit()
109+
110+
if not args.bam.endswith(".cram"):
111+
command="{}/bin/TIDDIT --cov -b {} -o {} -z {}".format(args.wd,args.bam,args.o,args.z)
112+
else:
113+
if not args.ref:
114+
print("error, missing reference sequence!")
115+
quit()
116+
command="samtools view -hbu {} -T {} | {}/bin/TIDDIT --cov -b /dev/stdin -o {} -z {}".format(args.bam,args.ref,args.wd,args.o,args.z)
117+
90118
if args.w:
91119
command += " -w"
92120

src/DBSCAN.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,7 +98,11 @@ def analyse_pos(candidate_signals,discordants,library_stats,args):
9898
def generate_clusters(chrA,chrB,coordinates,library_stats,args):
9999
candidates=[]
100100
coordinates=coordinates[numpy.lexsort((coordinates[:,1],coordinates[:,0]))]
101-
db=main(coordinates[:,0:2],args.e,int(round(args.l+library_stats["ploidies"][chrA]/(args.n*10))))
101+
min_pts=args.l
102+
if chrA == chrB and library_stats["ploidies"][chrA] > args.n*2:
103+
min_pts=int(round(args.l/float(args.n)*library_stats["ploidies"][chrA]))
104+
105+
db=main(coordinates[:,0:2],args.e,min_pts)
102106
unique_labels = set(db)
103107

104108
for var in unique_labels:

src/TIDDIT.cpp

Lines changed: 27 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
#include <sstream>
1515
#include <iostream>
1616
#include <fstream>
17+
#include "api/BamWriter.h"
1718

1819
#include "data_structures/ProgramModules.h"
1920
//converts a string to int
@@ -45,7 +46,7 @@ int main(int argc, char **argv) {
4546
int min_variant_size= 100;
4647
int sample = 100000000;
4748
string outputFileHeader ="output";
48-
string version = "2.10.0";
49+
string version = "2.11.0";
4950

5051
//collect all options as a vector
5152
vector<string> arguments(argv, argv + argc);
@@ -358,8 +359,6 @@ int main(int argc, char **argv) {
358359
genomeLength += StringToNumber(sequence->Length);
359360
contigsNumber++;
360361
}
361-
bamFile.Close();
362-
363362

364363
//if the find structural variations module is chosen collect the options
365364
if(vm["--sv"] == "found"){
@@ -397,13 +396,26 @@ int main(int argc, char **argv) {
397396
ploidy = convert_str( vm["-n"],"-n" );
398397
}
399398

399+
BamWriter sampleBam;
400+
sampleBam.SetCompressionMode(BamWriter::Uncompressed);
401+
string sampleBamName=outputFileHeader+".sample.bam";
402+
403+
sampleBam.Open(sampleBamName.c_str(), head, bamFile.GetReferenceData());
404+
//sample the first reads
405+
406+
BamAlignment sampleRead;
407+
for (int i=0;i< sample;i++){
408+
bamFile.GetNextAlignment(sampleRead);
409+
sampleBam.SaveAlignment(sampleRead);
410+
}
411+
sampleBam.Close();
412+
400413
//now compute library stats
401414
LibraryStatistics library;
402415
size_t start = time(NULL);
403-
library = computeLibraryStats(alignmentFile, genomeLength, max_insert, 50 , outtie,minimum_mapping_quality,outputFileHeader,sample);
416+
library = computeLibraryStats(genomeLength, max_insert, 50 , outtie,minimum_mapping_quality,outputFileHeader,sample);
404417
printf ("library stats time consumption= %lds\n", time(NULL) - start);
405418

406-
407419
coverage = library.C_A;
408420
if(vm["-c"] != ""){
409421
coverage = convert_str( vm["-c"],"-c" );
@@ -442,10 +454,10 @@ int main(int argc, char **argv) {
442454
SV_options["STDInsert"] = insertStd;
443455
SV_options["min_variant_size"] = min_variant_size;
444456
SV_options["splits"] = minimumSupportingReads;
445-
457+
446458
StructuralVariations *FindTranslocations;
447459
FindTranslocations = new StructuralVariations();
448-
FindTranslocations -> findTranslocationsOnTheFly(alignmentFile, outtie, coverage,outputFileHeader, version, "TIDDIT" + argString,SV_options, genomeLength);
460+
FindTranslocations -> findTranslocationsOnTheFly(alignmentFile,bamFile, outtie, coverage,outputFileHeader, version, "TIDDIT" + argString,SV_options, genomeLength);
449461

450462
//the coverage module
451463
}else if(vm["--cov"] == "found"){
@@ -474,16 +486,19 @@ int main(int argc, char **argv) {
474486
span = true;
475487
}
476488

477-
calculateCoverage = new Cov(binSize,alignmentFile,outputFileHeader,0,wig,skipQual,span);
478-
BamReader bam;
479-
bam.Open(alignmentFile);
489+
//BamReader bam;
490+
//bam.Open(alignmentFile);
491+
SamHeader head = bamFile.GetHeader();
492+
calculateCoverage = new Cov(binSize,head,outputFileHeader,0,wig,skipQual,span);
493+
480494
BamAlignment currentRead;
481-
while ( bam.GetNextAlignmentCore(currentRead) ) {
495+
while ( bamFile.GetNextAlignmentCore(currentRead) ) {
482496
readStatus alignmentStatus = computeReadType(currentRead, 100000,100, true);
483497
calculateCoverage -> bin(currentRead, alignmentStatus);
484498
}
485-
bam.Close();
486499
calculateCoverage -> printCoverage();
500+
return(0);
501+
487502
}
488503

489504
}

src/TIDDIT_calling.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,13 @@ def generate_vcf_line(chrA,chrB,n,candidate,args,library_stats,percentiles,span_
140140
vcf_line.append(FORMAT_FORMAT)
141141
CN="."
142142
if "DEL" in var or "DUP" in var:
143-
CN=int(round(candidate["covM"]/(library_stats["Coverage"]/args.n)))
143+
if library_stats["Coverage"]:
144+
CN=int(round(candidate["covM"]/(library_stats["Coverage"]/args.n)))
145+
elif library_stats["chr_cov"][chrA]:
146+
CN=int(round(candidate["covM"]/(library_stats["chr_cov"][chrA]/library_stats["ploidies"][chrA])))
147+
else:
148+
CN=0
149+
144150
if "DEL" in var:
145151
CN=library_stats["ploidies"][chrA]-CN
146152
if CN < 0:

src/TIDDIT_filtering.py

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,26 @@ def fetch_filter(chrA,chrB,candidate,args,library_stats):
55
filt="PASS"
66

77
#Less than the expected number of signals
8-
ratio_scaling=min([ candidate["covA"]/library_stats["chr_cov"][chrA], candidate["covB"]/library_stats["chr_cov"][chrB] ])
8+
r_a=0
9+
if library_stats["chr_cov"][chrA]:
10+
r_a=candidate["covA"]/library_stats["chr_cov"][chrA]
11+
r_b=0
12+
if library_stats["chr_cov"][chrB]:
13+
r_b=candidate["covB"]/library_stats["chr_cov"][chrB]
14+
15+
16+
ratio_scaling=min([ r_b, r_a ])
17+
918
if ratio_scaling > 2:
1019
ratio_scaling=2
1120
elif ratio_scaling < 1:
1221
ratio_scaling=1
1322

14-
if candidate["ratio"]*ratio_scaling <= 0.2 and candidate["discs"] > candidate["splits"]:
23+
if chrA == chrB and library_stats["ploidies"][chrA] > 10 and candidate["ratio"] <= 0.05:
24+
filt = "BelowExpectedLinks"
25+
elif candidate["ratio"]*ratio_scaling <= args.p_ratio and candidate["discs"] > candidate["splits"]:
1526
filt = "BelowExpectedLinks"
16-
elif candidate["ratio"]*ratio_scaling <= 0.1 and candidate["discs"] < candidate["splits"]:
27+
elif candidate["ratio"]*ratio_scaling <= args.r_ratio and candidate["discs"] < candidate["splits"]:
1728
filt = "BelowExpectedLinks"
1829

1930
#The ploidy of this contig is 0, hence there shoud be no variant here

0 commit comments

Comments
 (0)