Skip to content

Conversation

kjaisingh
Copy link
Collaborator

@kjaisingh kjaisingh commented Jul 1, 2025

Description

This PR includes code to refactor VCFs passed through GATK-SV to align it to Pysam's upgraded versions, which do not allow variants to have an END position that is less than its start position. We hence ensure that all VCFs passed through GATK-SV represent the END field for all BND variants to pos+1, and set the END2 field to be the true end position on the second chromosome. All downstream scripts that consume such VCFs are made to interpret BNDs with appropriate logic based on this updated END2 tag.

As it stands, VCFs only use the correct END format starting from the outputs of 13-ResolveComplexVariants. However, it sets the END field for such variants to pos, though this is no longer supported in later versions of Pysam in which case this tag is simply dropped - hence why we use pos+1 in this PR.

Testing

  • The following Terra workspace contains an end-to-end run of GATK-SV with the original workflows.
  • The following Terra workspace contains an end-to-end run of GATK-SV with the updated changes included.
  • The [following slide deck)(https://docs.google.com/presentation/d/103aBnn2omKI-CjjNckCFfg94iMPQ2WS7v_P0wc44ew4/edit?usp=sharing) contains some figures that compare the results across each of the above.
  • Workflow-specific differences observed across file outputs:
    • GatherSampleEvidence:
      • wham_vcf has some minor differences, as Wham is a stochastic algorithm.
      • sample_metrics_files now has the manta_<SAMPLE>_vcf_invalid_end metric reduce to 0.
    • GatherBatchEvidence:
      • std_manta_vcf_tar now includes well-formatted END/END2 tags.
      • manta_tloc now has a different ordering of INFO fields.
      • Wham-related outputs (i.e. std_wham_vcf_tar) have minor differences, as expected.
    • ClusterBatch:
      • clustered_manta_vcf now includes well-formatted END/END2 tags.
      • Wham-related outputs (i.e. clustered_wham_vcf, Wham's clustered_sv_counts, Wham metrics in metrics_file_clusterbatch) have minor differences, as expected.
    • GenerateBatchMetrics:
      • metrics and metrics_common differ slightly for Wham records, given the stochasticity of Wham.
      • metrics_file_batchmetrics differs slightly due to the presence of variable Wham records.
    • filtered_vcf in FilteredGenotypes:
      • The CHR2 tag no longer exists for variants with SVTYPE=CPX.

Pre-Merge Changes Required

  • Based on the Omitted (use _.stop_ but no longer used in GATK-SV) section, archive all scripts that are no longer used in GATK-SV.
  • Update JoinRawCalls workflow in featured workspace to not use --fix-end.
  • Remove the temporary changes to src/svtk/svtk/utils/__init__.py and the automated syncing of WDLs to Dockstore.

Change Log

Included (uses .stop):

  • src/svtk/svtk/standardize/standardize.py: Called in GatherSampleEvidence and GatherBatchEvidence - requires updating as the call to in TinyResolve leverages Pysam on the standardized VCFs.
  • src/svtk/svtk/standardize/std_manta.py: Called in GatherSampleEvidence and GatherBatchEvidence - requires updating as the call to svtk standardize in TinyResolve leverages Pysam on the standardized VCFs.
  • src/svtk/svtk/standardize/std_dragen.py: Called in GatherBatchEvidence - requires updating as the call to svtk standardize in StandardizeVCFs leverages Pysam to build standardized VCFs.
  • src/svtk/svtk/cli/resolve.py: Called in GatherBatchEvidence by TinyResolve.
  • src/sv-pipeline/scripts/format_svtk_vcf_for_gatk.py: Called in ClusterBatch, CombineBatches, CleanVcf and JoinRawCalls - requires updating to set the END tag to record.pos instead of record.start+1.
  • src/sv-pipeline/scripts/format_gatk_vcf_for_svtk.py: Called in ClusterBatch and CombineBatches - requires updating to set the END and END2 tags appropriately.
  • src/svtk/svtk/cli/pesr_test.py: Called in GenerateBatchMetrics
  • src/svtk/svtk/pesr/pe_test.py: Called in GenerateBatchMetrics.
  • src/svtk/svtk/pesr/sr_test.py: Called in GenerateBatchMetrics.
  • src/svtk/svtk/pesr/breakpoint.py: Called in GenerateBatchMetrics.
  • src/svtk/svtk/utils/utils.py: Called via vcf2bed in GenerateBatchMetrics, CleanVcf, ResolveComplexVariants and more.
  • src/sv-pipeline/02_evidence_assessment/02e_metric_aggregation/scripts/aggregate.py: Called in GenerateBatchMetrics.
  • src/sv-pipeline/scripts/annotate_bnd_coords.py: Called in FilterBatchSites - new script that adds the END2 field to variants reclassified by random forest model to be a BND from other variant types.
  • src/sv-pipeline/03_variant_filtering/scripts/rewrite_SR_coords.py: Called in FilterBatchSites.
  • src/sv-pipeline/04_variant_resolution/scripts/postCPX_cleanup.py: Called in ResolveComplexVariants - requires updating to set the END tag to record.pos instead of record.start+1, and to not overwrite the already-corrected END2 tag.
  • src/sv-pipeline/04_variant_resolution/scripts/overlap_breakpoint_filter.py: Called in ResolveComplexVariants.
  • src/svtk/svtk/svfile.py: Called in ResolveComplexVariants.
  • src/svtk/svtk/cxsv/complex_sv.py: Called in ResolveComplexVariants.
  • src/svtk/svtk/cxsv/cpx_inv.py: Called in ResolveComplexVariants.
  • src/svtk/svtk/cxsv/cpx_link.py: Called in ResolveComplexVariants.
  • src/svtk/svtk/cxsv/cpx_tloc.py: Called in ResolveComplexVariants.
  • src/svtk/svtk/cxsv/rescan_single_enders.py: Called in ResolveComplexVariants.

Included (doesn't use .stop):

  • src/svtk/svtk/cli/standardize_vcf.py: Called in GatherSampleEvidence and GatherBatchEvidence - requires updating as the END2 header line must be in in the output VCF if it is set during standardization.
  • wdl/PESRClustering.wdl: Called in ClusterBatch - requires updating to not use --fix-end during GATK formatting as these have already been fixed during standardization, and to not pass remove_infos="END2" during SVTK formatting as these are now the de-facto representation.
  • wdl/FilterBatchSites.wdl: Called in FilterBatchSites - requires updating to call annotate_bnd_coords.py
  • wdl/CombineBatches.wdl: Called in CombineBatches - requires updating to not use --fix-end during GATK formatting as these have already been fixed during standardization, and to not pass remove_infos="END2" during SVTK formatting as these are now the de-facto representation.
  • inputs/templates/terra_workspaces/cohort_mode/workflow_configurations/JoinRawCalls.json.tmpl: Used in JoinRawCalls - requires updating to not use --fix-end during GATK formatting.
  • inputs/templates/test/JoinRawCalls/JoinRawCalls.json.tmpl: Used in JoinRawCalls - requires updating to not use --fix-end during GATK formatting.

Omitted (uses .stop but don't require changing):

  • src/sv-pipeline/scripts/single_sample/update_variant_representations.py: Called in GATKSVPipelineSingleSample, but already handles presence of END2 tag.
  • src/sv-pipeline/scripts/make_scramble_vcf.py: Called in GatherSampleEvidence, but the algorithm does not call sites with BNDs.
  • src/svtk/svtk/standardize/std_wham.py: Called in GatherSampleEvidence and GatherBatchEvidence, but the algorithm does not call sites with SVTYPE=BND.
  • src/svtk/svtk/standardize/std_scramble.py Called in GatherSampleEvidence and GatherBatchEvidence, but the algorithm does not call sites with SVTYPE=BND.
  • src/WGD/bin/convert_gcnv.py: Called in GatherBatchEvidence, but BNDs are not in gCNV VCFs.
  • src/svtk/svtk/cli/rdtest2vcf.py: Called in GatherBatchEvidence, but only applies to DEL/DUP events.
  • src/sv-pipeline/04_variant_resolution/scripts/merge_vcfs.py: Called in MergeBatchSites, but already checks for equality on the END, END2 and CHR2 fields.
  • src/sv-pipeline/04_variant_resolution/scripts/process_posthoc_cpx_depth_regenotyping.py: Called in GenotypeComplexVariants, but VCF is already in correct format by this point.
  • src/sv-pipeline/04_variant_resolution/scripts/clean_vcf_part1b_filter.py: Called in CleanVcf, but VCF is already in correct format by this point.
  • src/sv-pipeline/04_variant_resolution/scripts/clean_vcf_part5_update_records.py: Called in CleanVcf, but VCF is already in correct format by this point.
  • src/sv-pipeline/04_variant_resolution/scripts/resolve_cpx_cnv_redundancies.py: Called in CleanVcf, but VCF is already in correct format by this point.
  • wdl/RefineComplexVariants.wdl: Called in RefineComplexVariants, but VCF is already in correct format by this point.
  • wdl/CleanVcfChromosome.wdl: Called in CleanVcf, but VCF is already in correct format by this point.
  • src/sv-pipeline/scripts/make_sl_table.py: Called in FilterGenotypes, but VCF is already in correct format by this point.
  • src/sv-pipeline/05_annotation/scripts/compute_AFs.py: Called in AnnotateVcf, but VCF is already in correct format by this point.
  • src/sv-pipeline/scripts/identify_duplicates.py: Called in MainVcfQc, but VCF is already in correct format by this point.

Omitted (uses .stop but no longer used in GATK-SV):

  • src/svtk/svtk/standardize/std_delly.py: GATK-SV no longer supports this caller.
  • src/svtk/svtk/standardize/std_lumpy.py: GATK-SV no longer supports this caller.
  • src/svtk/svtk/standardize/std_melt.py: GATK-SV no longer supports this caller.
  • src/svtk/svtk/standardize/std_smoove.py: GATK-SV no longer supports this caller.
  • src/sv-pipeline/04_variant_resolution/scripts/aggregate_dn.py: GATK-SV no longer uses this script.
  • src/sv-pipeline/04_variant_resolution/scripts/eliminate_redundancies.py: GATK-SV no longer uses this script.
  • src/sv-pipeline/04_variant_resolution/scripts/final_filter.py: GATK-SV no longer uses this script.
  • src/sv-pipeline/04_variant_resolution/scripts/make_concordant_multiallelic_alts.py: GATK-SV no longer uses this script.
  • src/sv-pipeline/04_variant_resolution/scripts/merge_linked_depth_calls.py: GATK-SV no longer uses this script.
  • src/sv-pipeline/04_variant_resolution/scripts/merge_pesr_depth.py: GATK-SV no longer uses this script.
  • src/sv-pipeline/04_variant_resolution/scripts/overlap_pass.py: GATK-SV no longer uses this script.
  • src/sv-pipeline/04_variant_resolution/scripts/remove_added_pilot_variants.py: GATK-SV no longer uses this script.
  • src/sv-pipeline/04_variant_resolution/scripts/scrape_stats.py: : GATK-SV no longer uses this script.
  • src/sv_utils/src/sv_utils/fix_vcf.py: GATK-SV no longer uses this script.
  • src/sv_utils/src/sv_utils/genomics_io.py: GATK-SV no longer uses this script.
  • src/sv_utils/src/sv_utils/interval_overlaps.py: GATK-SV no longer uses this script.
  • src/svtest/svtest/cli/vcf.py: GATK-SV no longer uses this script.
  • src/svtest/svtest/utils/VCFUtils.py: GATK-SV no longer uses this script.
  • src/svtk/svtk/utils/rdtest.py: GATK-SV no longer uses this script.

Omitted (uses .stop but part of a non-pipeline workflow):

  • src/sv-pipeline/scripts/format_pb_for_gatk.py: Called in MakeGqRecalibratorTrainingSetFromPacBio.
  • src/sv-pipeline/scripts/preprocess_gatk_for_pacbio_eval.py: Called in MakeGqRecalibratorTrainingSetFromPacBio.
  • src/sv-pipeline/scripts/refine_training_set.py: Called in MakeGqRecalibratorTrainingSetFromPacBio.

@kjaisingh kjaisingh self-assigned this Jul 1, 2025
@kjaisingh kjaisingh added the dependencies Pull requests that update a dependency file label Jul 1, 2025
@kjaisingh kjaisingh changed the title [Do Not Merge] Updates all BND-agnostic references of Pysam's .stop across GATK-SV [Do Not Merge] Updates all BND-agnostic variant references via Pysam across GATK-SV Jul 1, 2025
@kjaisingh kjaisingh changed the title [Do Not Merge] Updates all BND-agnostic variant references via Pysam across GATK-SV Update all BND-agnostic variant references via Pysam Jul 2, 2025
@kjaisingh kjaisingh changed the title Update all BND-agnostic variant references via Pysam Update all BND-agnostic references to variant end positions via Pysam Jul 2, 2025
@kjaisingh kjaisingh changed the title Update all BND-agnostic references to variant end positions via Pysam Update representation of BNDs in GATK-SV Jul 28, 2025
@kjaisingh kjaisingh changed the title Update representation of BNDs in GATK-SV Update representation of BNDs to be Pysam compatible Jul 31, 2025
@kjaisingh kjaisingh marked this pull request as ready for review August 5, 2025 17:33
@kjaisingh kjaisingh requested a review from mwalker174 August 6, 2025 17:30
@kjaisingh kjaisingh marked this pull request as draft August 14, 2025 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant