Skip to content

converting bcf 1st phase data as in v44 #1938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Jul 22, 2025

Fix for #1932
It was discussed earlier to keep the phasing data in VCF44 format internally as that easily resolves the issue mentioned (will necessitate some changes in bcftools as there are a few checks with v4.3 type phase values - in convert tests) with VCF checks.
Data needs a consistent binary representation, irrespective of VCF/BCF source, so the same conversion is needed for BCF data as well, from v4.x to v4.4 in terms of phasing values. This conversion is made on read, for bcf with version < v44. While writing, it is again converted that data is stored without the change made, if version < v44. This adds ~15% overhead it seems.
--removed the mention of a few changes that seemed required in bcftools, it worked fine without any such change with later checks--

@daviesrob daviesrob self-assigned this Jul 24, 2025
@vasudeva8 vasudeva8 marked this pull request as ready for review August 4, 2025 17:13
Copy link
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If updatephasing() is changed as suggested, all of HTSlib's tests still pass.

Most of bcftools' do as well, apart from bcftools convert --haplegendsample which starts incorrectly marking lots of genotypes as partially phased. This is because the process_gt_to_hap() and process_gt_to_hap2() functions currently assume the first phasing bit is always zero. As they're likely broken for VCF4.4, they'll need to be fixed irrespective of this change. Adjusting them to ignore the first phasing bit gets them working again, and after that all bcftools tests pass.

vcf.c Outdated
Comment on lines 1866 to 1879
/*set phased on read or make unphased on write as upto 4.3 1st
phasing is not described explicitly and has to be inferred*/
switch(type) {
case BCF_BT_INT8:
*(uint8_t*)ptr1 += upd;
break;
case BCF_BT_INT16:
i16_to_le(le_to_i16(ptr1) + upd, ptr1);
break;
case BCF_BT_INT32:
i32_to_le(le_to_i32(ptr1) + upd, ptr1);
break;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be doing |= 1 instead of + upd. Adding or subtracting values risks accidentally changing the GT if the function is called with set == 1 and the phasing bit already set; or changing it to missing if called with set == 0 and the existing value was 2.

I also don't think it's necessary to revert the bit, so the set argument could be removed. While the written bcfs may differ slightly, it doesn't appear to matter, and any code relying on the leading bit being zero will likely break on VCF4.4+ so needs to be fixed anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is better to change to |/& rather than using + to avoid the troubles mentioned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

vcf.c Outdated
* If the version in header is >= 4.4, no change is made. Otherwise 1st phasing
* is set if there are no other unphased ones.
*/
HTSLIB_EXPORT int update44phasing(bcf_hdr_t *h, bcf1_t *b, int setreset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we only allow the phasing to be set, we can drop setreset here too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

vcf.c Outdated
Comment on lines 2445 to 2449
if (update44phasing(h, v, 0)) { //reset phasing update made after read
hts_log_error("Failed to set prorper phasing at %s:%"PRIhts_pos"", bcf_seqname_safe(h,v), v->pos+1);
return -1;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@vasudeva8
Copy link
Contributor Author

vasudeva8 commented Aug 7, 2025

The reset made on bcf write is removed.
This will make the output different from input. BCF is not widely in use due to its inherent compatibility issues and hence this difference can be ignored.

@vasudeva8 vasudeva8 force-pushed the phase44update1 branch 2 times, most recently from fce7ac9 to 8af52e2 Compare August 8, 2025 11:37
@@ -708,6 +708,11 @@ static int _reader_fill_buffer(bcf_srs_t *files, bcf_sr_t *reader)
ret = bcf_itr_next(reader->file, reader->itr, reader->buffer[reader->nbuffer+1]);
if ( ret < -1 ) files->errnum = bcf_read_error;
if ( ret < 0 ) break; // no more lines or an error
//set phasing of 1st value as in vcf v44
if ((ret = update44phasing(reader->header, reader->buffer[reader->nbuffer+1]))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious how this affects speed?

Copy link
Contributor Author

@vasudeva8 vasudeva8 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially it was observed to make ~15% overhead. On recheck, last update had more overhead and tweaked further. It shows ~13% overhead with bcftools view with the latest changes
.

@vasudeva8 vasudeva8 force-pushed the phase44update1 branch 2 times, most recently from e94005e to ad77218 Compare August 15, 2025 14:25
@vasudeva8
Copy link
Contributor Author

vasudeva8 commented Aug 15, 2025

The BCF conversion is removed, to avoid the performance overhead, based on discussions.
1st phasing value in VCF, with version upto v4.3, are set based on rest of the values in the sample.
The API bcf_get_format_values modified to do the same conversion when the retrieved field is GT, limiting any overhead to such cases where GT is retrieved.

With this, BCF and VCF will have different internal binary values/representations.
When the bcf_get_format_values api is used, phasing data as per v4.4 is retrieved.
When the values are accessed directly, w/o the get_format api, user has to determine the phasing for 1st value based on phasing of other alleles.

As conversion is made for vcf, merge issue mentioned is fixed for vcf/vcf.gz. When input is bcf, the issue still exists and conversion may still be required. Performance issue will be looked further that it can be used for BCF as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants