Skip to content

Advanced imputation metrics #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions data/test/merge_gvcfs.bed
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
chr3 611489 616181
chr3 617969 619167
chr7 1047602 1169685
71 changes: 71 additions & 0 deletions data/test/merge_gvcfs_subset/merge_gvcfs-chr3_611490-616181.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=3,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AF,Number=3,Type=Integer,Description="Allele Frequency">
##INFO=<ID=ASM_Chr,Number=1,Type=String,Description="Assembly chromosome">
##INFO=<ID=ASM_End,Number=1,Type=Integer,Description="Assembly end position">
##INFO=<ID=ASM_Start,Number=1,Type=Integer,Description="Assembly start position">
##INFO=<ID=ASM_Strand,Number=1,Type=String,Description="Assembly strand">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
chr3 611512 . C G . . . GT 1 0 0
chr3 611768 . G A . . . GT 1 0 0
chr3 612208 . C T . . . GT 0 1 1
chr3 612646 . A C . . . GT 1 0 0
chr3 612705 . G A . . . GT 1 0 0
chr3 612766 . C G . . . GT 1 1 1
chr3 613041 . C G,<DEL> . . . GT 1 2 2
chr3 613045 . G A . . . GT 1 0 0
chr3 613046 . G A . . . GT 1 0 0
chr3 613154 . C T . . . GT 0 1 1
chr3 613697 . G T . . . GT 1 0 0
chr3 614008 . C T . . . GT 1 1 1
chr3 614010 . A G . . . GT 1 1 1
chr3 614032 . T A . . . GT 1 0 0
chr3 614060 . G C . . . GT 0 1 1
chr3 614068 . A C . . . GT 0 1 1
chr3 614098 . T C . . . GT 0 1 1
chr3 614121 . A G . . . GT 0 1 1
chr3 614172 . A G . . . GT 0 1 1
chr3 614173 . G A . . . GT 0 1 1
chr3 614189 . G A . . . GT 0 1 1
chr3 614190 . G A . . . GT 1 1 1
chr3 614214 . A G . . . GT 0 1 1
chr3 614262 . T G . . . GT 0 1 1
chr3 614287 . C T . . . GT 0 1 1
chr3 614363 . T C . . . GT 1 1 1
chr3 614376 . C T . . . GT 0 1 1
chr3 614405 . A T . . . GT 0 1 1
chr3 614409 . T A . . . GT 0 1 1
chr3 614437 . A C . . . GT 0 1 1
chr3 614444 . A C . . . GT 0 1 1
chr3 614569 . T A . . . GT 1 1 1
chr3 614730 . T G . . . GT 1 1 1
chr3 614741 . A C . . . GT 0 1 1
chr3 614793 . G T . . . GT 0 1 1
chr3 614824 . A T . . . GT 0 1 1
chr3 614834 . G T . . . GT 0 1 1
chr3 615048 . C T . . . GT 0 0 1
chr3 615051 . T A . . . GT 0 1 1
chr3 615173 . A T . . . GT 1 1 1
chr3 615286 . A G . . . GT 1 0 0
chr3 615344 . A T . . . GT 1 1 1
chr3 615366 . T C,<DEL> . . . GT 2 1 1
chr3 615398 . G A . . . GT 0 1 1
chr3 615448 . C T . . . GT 1 1 1
chr3 615459 . G C . . . GT 1 1 1
chr3 615461 . G C . . . GT 0 1 1
chr3 615465 . G A . . . GT 0 1 1
chr3 615481 . G A . . . GT 1 0 0
chr3 615493 . T C . . . GT 1 0 0
chr3 615494 . G T . . . GT 1 0 0
chr3 615747 . G A . . . GT 1 1 1
chr3 615823 . T C . . . GT 1 1 1
chr3 615865 . A T . . . GT 1 1 1
chr3 615968 . A G . . . GT 1 1 1
chr3 616102 . T C . . . GT 1 0 0
73 changes: 73 additions & 0 deletions data/test/merge_gvcfs_subset/merge_gvcfs-chr3_617970-619167.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=3,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AF,Number=3,Type=Integer,Description="Allele Frequency">
##INFO=<ID=ASM_Chr,Number=1,Type=String,Description="Assembly chromosome">
##INFO=<ID=ASM_End,Number=1,Type=Integer,Description="Assembly end position">
##INFO=<ID=ASM_Start,Number=1,Type=Integer,Description="Assembly start position">
##INFO=<ID=ASM_Strand,Number=1,Type=String,Description="Assembly strand">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
chr3 618018 . A G . . . GT 1 1 1
chr3 618024 . A T . . . GT 1 1 1
chr3 618156 . T G,<DEL> . . . GT 1 2 2
chr3 618168 . G A,<DEL> . . . GT 1 2 2
chr3 618170 . T C,<DEL> . . . GT 1 2 2
chr3 618171 . A C,<DEL> . . . GT 1 2 2
chr3 618178 . A T,<DEL> . . . GT 1 2 2
chr3 618251 . T A . . . GT 0 1 1
chr3 618272 . T C . . . GT 1 0 0
chr3 618275 . A G . . . GT 0 1 1
chr3 618295 . A G . . . GT 1 1 1
chr3 618299 . T C . . . GT 1 1 1
chr3 618306 . A G . . . GT 1 1 1
chr3 618320 . A G . . . GT 0 1 1
chr3 618324 . G A . . . GT 0 1 1
chr3 618350 . G A . . . GT 0 1 1
chr3 618364 . T G . . . GT 0 1 1
chr3 618369 . C T . . . GT 1 0 0
chr3 618386 . G A . . . GT 0 1 1
chr3 618404 . G A . . . GT 1 0 0
chr3 618407 . G A . . . GT 1 0 0
chr3 618410 . G C . . . GT 1 0 0
chr3 618418 . A T . . . GT 1 0 0
chr3 618420 . G C . . . GT 0 1 1
chr3 618454 . G C . . . GT 0 1 1
chr3 618465 . G A . . . GT 1 0 0
chr3 618501 . C T . . . GT 1 1 1
chr3 618509 . T C . . . GT 1 0 0
chr3 618524 . G A . . . GT 0 1 1
chr3 618529 . C T . . . GT 0 1 1
chr3 618548 . A T . . . GT 0 1 1
chr3 618559 . C T . . . GT 1 1 1
chr3 618564 . A G . . . GT 1 1 1
chr3 618568 . C G . . . GT 1 1 1
chr3 618581 . C T . . . GT 0 1 1
chr3 618594 . G A . . . GT 1 1 1
chr3 618606 . G T . . . GT 1 0 0
chr3 618644 . A G . . . GT 1 1 1
chr3 618672 . A G . . . GT 1 0 0
chr3 618698 . G C . . . GT 0 1 1
chr3 618706 . A G . . . GT 1 0 0
chr3 618712 . A G . . . GT 1 0 0
chr3 618727 . G A . . . GT 1 0 0
chr3 618745 . A T . . . GT 0 1 1
chr3 618874 . G A . . . GT 1 1 1
chr3 618882 . T A . . . GT 0 1 1
chr3 618889 . T A . . . GT 0 1 1
chr3 618943 . G T . . . GT 0 1 1
chr3 618963 . T C . . . GT 1 0 0
chr3 618987 . C A,T . . . GT 1 2 2
chr3 618994 . G A . . . GT 1 1 1
chr3 619031 . C T . . . GT 1 1 1
chr3 619044 . C T . . . GT 0 1 1
chr3 619052 . C A . . . GT 0 1 1
chr3 619053 . C T . . . GT 0 1 1
chr3 619054 . G A . . . GT 0 1 1
chr3 619102 . G T,<DEL> . . . GT 2 1 1
chr3 619166 . A G . . . GT 1 1 1
32 changes: 32 additions & 0 deletions data/test/merge_gvcfs_subset/merge_gvcfs-chr7_1047603-1169685.vcf
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
##fileformat=VCFv4.2
##FORMAT=<ID=AD,Number=3,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth (only filtered reads used for calling)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=3,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##INFO=<ID=AF,Number=3,Type=Integer,Description="Allele Frequency">
##INFO=<ID=ASM_Chr,Number=1,Type=String,Description="Assembly chromosome">
##INFO=<ID=ASM_End,Number=1,Type=Integer,Description="Assembly end position">
##INFO=<ID=ASM_Start,Number=1,Type=Integer,Description="Assembly start position">
##INFO=<ID=ASM_Strand,Number=1,Type=String,Description="Assembly strand">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3
chr7 1047637 . C T,<INS> . . . GT 1 2 1
chr7 1153180 . T G,<INS> . . . GT 2 1 0
chr7 1153182 . T C,<INS> . . . GT 0 1 2
chr7 1153183 . A C,<INS> . . . GT 0 2 1
chr7 1153292 . A T,<INS> . . . GT 1 1 2
chr7 1153343 . T G,<INS> . . . GT 2 2 1
chr7 1153353 . T A,<INS> . . . GT 2 2 1
chr7 1153393 . G T,<INS> . . . GT 2 1 2
chr7 1154153 . G T,<INS> . . . GT 2 2 1
chr7 1154345 . T G,<INS> . . . GT 1 1 2
chr7 1157847 . T C,<INS> . . . GT 1 1 2
chr7 1159061 . G A,<INS> . . . GT 2 1 1
chr7 1159063 . G T,<INS> . . . GT 1 2 2
chr7 1159274 . T A,<INS> . . . GT 0 1 2
chr7 1159330 . C A,<INS> . . . GT 2 1 1
chr7 1160237 . T G,<INS> . . . GT 1 2 2
chr7 1169685 . G A,<INS> . . . GT 1 0 2
28 changes: 28 additions & 0 deletions src/main/kotlin/biokotlin/util/BedUtils.kt
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
package biokotlin.util

import biokotlin.genome.PositionRange
import java.io.File
import java.util.*

object BedUtils {

/**
* Read a bed file and return a sorted set of PositionRange objects.
* Bed files are zero-based, inclusive / exclusive.
* PositionRange objects are one-based, inclusive / inclusive.
*/
fun readBedfile(bedFileName: String): SortedSet<PositionRange> {

require(File(bedFileName).exists()) { "File $bedFileName does not exist." }

return bufferedReader(bedFileName).readLines().map { line ->
val lineSplit = line.split("\t")
val chrom = lineSplit[0]
val start = lineSplit[1].toInt() + 1
val end = lineSplit[2].toInt()
PositionRange(chrom, start, end)
}.toSortedSet()

}

}
81 changes: 78 additions & 3 deletions src/main/kotlin/biokotlin/util/MergeGVCFUtils.kt
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
package biokotlin.util

import biokotlin.genome.PositionRange
import biokotlin.util.BedUtils.readBedfile
import htsjdk.variant.variantcontext.Allele
import htsjdk.variant.variantcontext.GenotypeBuilder
import htsjdk.variant.variantcontext.VariantContext
import htsjdk.variant.variantcontext.VariantContextBuilder
import htsjdk.variant.variantcontext.writer.Options
import htsjdk.variant.variantcontext.writer.VariantContextWriter
import htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder
import htsjdk.variant.vcf.VCFHeader
import kotlinx.coroutines.*
Expand All @@ -14,9 +17,15 @@ import java.io.File

private val myLogger = LogManager.getLogger("biokotlin.util.MergeGVCFUtils")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should have a header comment with a description of the scope of functionality that is included.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I added that

/**
* Merges GVCF files into a single VCF file.
* The GVCF files should have only one sample each.
* If a bedfile is provided, then the output VCF file is split into multiple files
* based on the ranges in the bedfile.
*/
object MergeGVCFUtils {

fun mergeGVCFs(inputDir: String, outputFile: String) {
fun mergeGVCFs(inputDir: String, outputFile: String, bedfile: String? = null) {

runBlocking {

Expand Down Expand Up @@ -61,8 +70,14 @@ object MergeGVCFUtils {

}

myLogger.info("writing output: $outputFile")
writeOutputVCF(outputFile, samples, variantContextChannel)
val theBedfile = bedfile?.trim()
if (theBedfile.isNullOrEmpty()) {
myLogger.info("Writing output: $outputFile")
writeOutputVCF(outputFile, samples, variantContextChannel)
} else {
myLogger.info("Writing output files with base name: $outputFile and bedfile: $bedfile")
writeOutputVCF(outputFile, theBedfile, samples, variantContextChannel)
}

}

Expand Down Expand Up @@ -97,6 +112,9 @@ private fun createVariantContext(

}

/**
* Writes the merged VCF file. This is used when a bedfile is not provided.
*/
private suspend fun writeOutputVCF(
outputFile: String,
samples: List<String>,
Expand Down Expand Up @@ -124,6 +142,63 @@ private suspend fun writeOutputVCF(

}

/**
* Writes the merged VCF files. This is used when a bedfile is provided.
* The bedfile is used to create one VCF file per range in the bedfile.
*/
private suspend fun writeOutputVCF(
outputFile: String,
bedfile: String,
samples: List<String>,
variantContextChannel: Channel<Deferred<List<VariantContext>>>
) {

val ranges = readBedfile(bedfile)

val header = VCFHeader(createGenericVCFHeaders(samples))

var writePositionRange: PositionRange? = null
var writer: VariantContextWriter? = null
for (deferred in variantContextChannel) {

val variantContexts = deferred.await()
variantContexts.forEach { variantContext ->

val currentPositionRange = PositionRange(variantContext.contig, variantContext.start, variantContext.end)

// If the current position range doesn't overlap with the current write position range
// then close the current writer and open a new one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the VariantContext record overlaps multiple ranges? Should it get split, with parts written to range-specific files for all files it overlaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are SNPs, so the VariantContext is only one position

if (writePositionRange == null || !writePositionRange!!.overlapping(currentPositionRange)) {

// Close the previous writer
writer?.close()
writer = null

// Find the range in the bedfile that overlaps with the current position range
// and create a new writer for that range. If no range is found, then don't create
// a writer.
writePositionRange = ranges.find { it.overlapping(currentPositionRange) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writePositionRange could be multiple ranges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are SNPs, so only one position per variant

writePositionRange?.let {
val filename = "${outputFile}-${writePositionRange!!.toString().replace(':', '_')}.vcf"
writer = VariantContextWriterBuilder()
.unsetOption(Options.INDEX_ON_THE_FLY)
.setOutputFile(File(filename))
.setOutputFileType(VariantContextWriterBuilder.OutputType.VCF)
.setOption(Options.ALLOW_MISSING_FIELDS_IN_HEADER)
.build()
writer?.writeHeader(header)
}

}

writer?.add(variantContext)

}

}

}

/**
* Creates a VariantContext for the current position, given the variants
* at that position from the GVCF readers.
Expand Down
Loading