-
Notifications
You must be signed in to change notification settings - Fork 77
Update representation of BNDs to be Pysam compatible #835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
kjaisingh
wants to merge
44
commits into
main
Choose a base branch
from
kj_bnd_pipeline_fixes_V2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR includes code to refactor VCFs passed through GATK-SV to align it to Pysam's upgraded versions, which do not allow variants to have an
END
position that is less than its start position. We hence ensure that all VCFs passed through GATK-SV represent theEND
field for all BND variants topos+1
, and set theEND2
field to be the true end position on the second chromosome. All downstream scripts that consume such VCFs are made to interpret BNDs with appropriate logic based on this updatedEND2
tag.As it stands, VCFs only use the correct
END
format starting from the outputs of13-ResolveComplexVariants
. However, it sets theEND
field for such variants topos
, though this is no longer supported in later versions of Pysam in which case this tag is simply dropped - hence why we usepos+1
in this PR.Testing
wham_vcf
has some minor differences, as Wham is a stochastic algorithm.sample_metrics_files
now has themanta_<SAMPLE>_vcf_invalid_end
metric reduce to 0.std_manta_vcf_tar
now includes well-formattedEND
/END2
tags.manta_tloc
now has a different ordering ofINFO
fields.std_wham_vcf_tar
) have minor differences, as expected.clustered_manta_vcf
now includes well-formattedEND
/END2
tags.clustered_wham_vcf
, Wham'sclustered_sv_counts
, Wham metrics inmetrics_file_clusterbatch
) have minor differences, as expected.metrics
andmetrics_common
differ slightly for Wham records, given the stochasticity of Wham.metrics_file_batchmetrics
differs slightly due to the presence of variable Wham records.filtered_vcf
in FilteredGenotypes:CHR2
tag no longer exists for variants withSVTYPE=CPX
.Pre-Merge Changes Required
Omitted (use _.stop_ but no longer used in GATK-SV)
section, archive all scripts that are no longer used in GATK-SV.JoinRawCalls
workflow in featured workspace to not use--fix-end
.src/svtk/svtk/utils/__init__.py
and the automated syncing of WDLs to Dockstore.Change Log
Included (uses .stop):
src/svtk/svtk/standardize/standardize.py
: Called in GatherSampleEvidence and GatherBatchEvidence - requires updating as the call to in TinyResolve leverages Pysam on the standardized VCFs.src/svtk/svtk/standardize/std_manta.py
: Called in GatherSampleEvidence and GatherBatchEvidence - requires updating as the call tosvtk standardize
in TinyResolve leverages Pysam on the standardized VCFs.src/svtk/svtk/standardize/std_dragen.py
: Called in GatherBatchEvidence - requires updating as the call tosvtk standardize
in StandardizeVCFs leverages Pysam to build standardized VCFs.src/svtk/svtk/cli/resolve.py
: Called in GatherBatchEvidence by TinyResolve.src/sv-pipeline/scripts/format_svtk_vcf_for_gatk.py
: Called in ClusterBatch, CombineBatches, CleanVcf and JoinRawCalls - requires updating to set theEND
tag torecord.pos
instead ofrecord.start+1
.src/sv-pipeline/scripts/format_gatk_vcf_for_svtk.py
: Called in ClusterBatch and CombineBatches - requires updating to set theEND
andEND2
tags appropriately.src/svtk/svtk/cli/pesr_test.py
: Called in GenerateBatchMetricssrc/svtk/svtk/pesr/pe_test.py
: Called in GenerateBatchMetrics.src/svtk/svtk/pesr/sr_test.py
: Called in GenerateBatchMetrics.src/svtk/svtk/pesr/breakpoint.py
: Called in GenerateBatchMetrics.src/svtk/svtk/utils/utils.py
: Called viavcf2bed
in GenerateBatchMetrics, CleanVcf, ResolveComplexVariants and more.src/sv-pipeline/02_evidence_assessment/02e_metric_aggregation/scripts/aggregate.py
: Called in GenerateBatchMetrics.src/sv-pipeline/scripts/annotate_bnd_coords.py
: Called inFilterBatchSites
- new script that adds theEND2
field to variants reclassified by random forest model to be a BND from other variant types.src/sv-pipeline/03_variant_filtering/scripts/rewrite_SR_coords.py
: Called in FilterBatchSites.src/sv-pipeline/04_variant_resolution/scripts/postCPX_cleanup.py
: Called in ResolveComplexVariants - requires updating to set theEND
tag torecord.pos
instead ofrecord.start+1
, and to not overwrite the already-correctedEND2
tag.src/sv-pipeline/04_variant_resolution/scripts/overlap_breakpoint_filter.py
: Called in ResolveComplexVariants.src/svtk/svtk/svfile.py
: Called in ResolveComplexVariants.src/svtk/svtk/cxsv/complex_sv.py
: Called in ResolveComplexVariants.src/svtk/svtk/cxsv/cpx_inv.py
: Called in ResolveComplexVariants.src/svtk/svtk/cxsv/cpx_link.py
: Called in ResolveComplexVariants.src/svtk/svtk/cxsv/cpx_tloc.py
: Called in ResolveComplexVariants.src/svtk/svtk/cxsv/rescan_single_enders.py
: Called in ResolveComplexVariants.Included (doesn't use .stop):
src/svtk/svtk/cli/standardize_vcf.py
: Called inGatherSampleEvidence
andGatherBatchEvidence
- requires updating as theEND2
header line must be in in the output VCF if it is set during standardization.wdl/PESRClustering.wdl
: Called inClusterBatch
- requires updating to not use--fix-end
during GATK formatting as these have already been fixed during standardization, and to not passremove_infos="END2"
during SVTK formatting as these are now the de-facto representation.wdl/FilterBatchSites.wdl
: Called inFilterBatchSites
- requires updating to callannotate_bnd_coords.py
wdl/CombineBatches.wdl
: Called inCombineBatches
- requires updating to not use--fix-end
during GATK formatting as these have already been fixed during standardization, and to not passremove_infos="END2"
during SVTK formatting as these are now the de-facto representation.inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl
: Used inJoinRawCalls
- requires updating to not use--fix-end
during GATK formatting.inputs/templates/test/JoinRawCalls/JoinRawCalls.json.tmpl
: Used inJoinRawCalls
- requires updating to not use--fix-end
during GATK formatting.Omitted (uses .stop but don't require changing):
src/sv-pipeline/scripts/single_sample/update_variant_representations.py
: Called in GATKSVPipelineSingleSample, but already handles presence ofEND2
tag.src/sv-pipeline/scripts/make_scramble_vcf.py
: Called in GatherSampleEvidence, but the algorithm does not call sites with BNDs.src/svtk/svtk/standardize/std_wham.py
: Called in GatherSampleEvidence and GatherBatchEvidence, but the algorithm does not call sites withSVTYPE=BND
.src/svtk/svtk/standardize/std_scramble.py
Called in GatherSampleEvidence and GatherBatchEvidence, but the algorithm does not call sites withSVTYPE=BND
.src/WGD/bin/convert_gcnv.py
: Called in GatherBatchEvidence, but BNDs are not in gCNV VCFs.src/svtk/svtk/cli/rdtest2vcf.py
: Called in GatherBatchEvidence, but only applies to DEL/DUP events.src/sv-pipeline/04_variant_resolution/scripts/merge_vcfs.py
: Called in MergeBatchSites, but already checks for equality on theEND
,END2
andCHR2
fields.src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py
: Called in GenotypeComplexVariants, but VCF is already in correct format by this point.src/sv-pipeline/04_variant_resolution/scripts/clean_vcf_part1b_filter.py
: Called in CleanVcf, but VCF is already in correct format by this point.src/sv-pipeline/04_variant_resolution/scripts/clean_vcf_part5_update_records.py
: Called in CleanVcf, but VCF is already in correct format by this point.src/sv-pipeline/04_variant_resolution/scripts/resolve_cpx_cnv_redundancies.py
: Called in CleanVcf, but VCF is already in correct format by this point.wdl/RefineComplexVariants.wdl
: Called in RefineComplexVariants, but VCF is already in correct format by this point.wdl/CleanVcfChromosome.wdl
: Called in CleanVcf, but VCF is already in correct format by this point.src/sv-pipeline/scripts/make_sl_table.py
: Called in FilterGenotypes, but VCF is already in correct format by this point.src/sv-pipeline/05_annotation/scripts/compute_AFs.py
: Called in AnnotateVcf, but VCF is already in correct format by this point.src/sv-pipeline/scripts/identify_duplicates.py
: Called in MainVcfQc, but VCF is already in correct format by this point.Omitted (uses .stop but no longer used in GATK-SV):
src/svtk/svtk/standardize/std_delly.py
: GATK-SV no longer supports this caller.src/svtk/svtk/standardize/std_lumpy.py
: GATK-SV no longer supports this caller.src/svtk/svtk/standardize/std_melt.py
: GATK-SV no longer supports this caller.src/svtk/svtk/standardize/std_smoove.py
: GATK-SV no longer supports this caller.src/sv-pipeline/04_variant_resolution/scripts/aggregate_dn.py
: GATK-SV no longer uses this script.src/sv-pipeline/04_variant_resolution/scripts/eliminate_redundancies.py
: GATK-SV no longer uses this script.src/sv-pipeline/04_variant_resolution/scripts/final_filter.py
: GATK-SV no longer uses this script.src/sv-pipeline/04_variant_resolution/scripts/make_concordant_multiallelic_alts.py
: GATK-SV no longer uses this script.src/sv-pipeline/04_variant_resolution/scripts/merge_linked_depth_calls.py
: GATK-SV no longer uses this script.src/sv-pipeline/04_variant_resolution/scripts/merge_pesr_depth.py
: GATK-SV no longer uses this script.src/sv-pipeline/04_variant_resolution/scripts/overlap_pass.py
: GATK-SV no longer uses this script.src/sv-pipeline/04_variant_resolution/scripts/remove_added_pilot_variants.py
: GATK-SV no longer uses this script.src/sv-pipeline/04_variant_resolution/scripts/scrape_stats.py
: : GATK-SV no longer uses this script.src/sv_utils/src/sv_utils/fix_vcf.py
: GATK-SV no longer uses this script.src/sv_utils/src/sv_utils/genomics_io.py
: GATK-SV no longer uses this script.src/sv_utils/src/sv_utils/interval_overlaps.py
: GATK-SV no longer uses this script.src/svtest/svtest/cli/vcf.py
: GATK-SV no longer uses this script.src/svtest/svtest/utils/VCFUtils.py
: GATK-SV no longer uses this script.src/svtk/svtk/utils/rdtest.py
: GATK-SV no longer uses this script.Omitted (uses .stop but part of a non-pipeline workflow):
src/sv-pipeline/scripts/format_pb_for_gatk.py
: Called in MakeGqRecalibratorTrainingSetFromPacBio.src/sv-pipeline/scripts/preprocess_gatk_for_pacbio_eval.py
: Called in MakeGqRecalibratorTrainingSetFromPacBio.src/sv-pipeline/scripts/refine_training_set.py
: Called in MakeGqRecalibratorTrainingSetFromPacBio.