-
Notifications
You must be signed in to change notification settings - Fork 456
converting bcf 1st phase data as in v44 #1938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
455a96e
to
aebd6ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If updatephasing()
is changed as suggested, all of HTSlib's tests still pass.
Most of bcftools' do as well, apart from bcftools convert --haplegendsample
which starts incorrectly marking lots of genotypes as partially phased. This is because the process_gt_to_hap()
and process_gt_to_hap2()
functions currently assume the first phasing bit is always zero. As they're likely broken for VCF4.4, they'll need to be fixed irrespective of this change. Adjusting them to ignore the first phasing bit gets them working again, and after that all bcftools tests pass.
vcf.c
Outdated
/*set phased on read or make unphased on write as upto 4.3 1st | ||
phasing is not described explicitly and has to be inferred*/ | ||
switch(type) { | ||
case BCF_BT_INT8: | ||
*(uint8_t*)ptr1 += upd; | ||
break; | ||
case BCF_BT_INT16: | ||
i16_to_le(le_to_i16(ptr1) + upd, ptr1); | ||
break; | ||
case BCF_BT_INT32: | ||
i32_to_le(le_to_i32(ptr1) + upd, ptr1); | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these should be doing |= 1
instead of + upd
. Adding or subtracting values risks accidentally changing the GT if the function is called with set == 1
and the phasing bit already set; or changing it to missing if called with set == 0
and the existing value was 2.
I also don't think it's necessary to revert the bit, so the set
argument could be removed. While the written bcfs may differ slightly, it doesn't appear to matter, and any code relying on the leading bit being zero will likely break on VCF4.4+ so needs to be fixed anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is better to change to |/& rather than using + to avoid the troubles mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
vcf.c
Outdated
* If the version in header is >= 4.4, no change is made. Otherwise 1st phasing | ||
* is set if there are no other unphased ones. | ||
*/ | ||
HTSLIB_EXPORT int update44phasing(bcf_hdr_t *h, bcf1_t *b, int setreset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we only allow the phasing to be set, we can drop setreset
here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
vcf.c
Outdated
if (update44phasing(h, v, 0)) { //reset phasing update made after read | ||
hts_log_error("Failed to set prorper phasing at %s:%"PRIhts_pos"", bcf_seqname_safe(h,v), v->pos+1); | ||
return -1; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
f3a5f06
to
07c0c5e
Compare
The reset made on bcf write is removed. |
fce7ac9
to
8af52e2
Compare
synced_bcf_reader.c
Outdated
@@ -708,6 +708,11 @@ static int _reader_fill_buffer(bcf_srs_t *files, bcf_sr_t *reader) | |||
ret = bcf_itr_next(reader->file, reader->itr, reader->buffer[reader->nbuffer+1]); | |||
if ( ret < -1 ) files->errnum = bcf_read_error; | |||
if ( ret < 0 ) break; // no more lines or an error | |||
//set phasing of 1st value as in vcf v44 | |||
if ((ret = update44phasing(reader->header, reader->buffer[reader->nbuffer+1]))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious how this affects speed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially it was observed to make ~15% overhead. On recheck, last update had more overhead and tweaked further. It shows ~13% overhead with bcftools view with the latest changes
.
e94005e
to
ad77218
Compare
The BCF conversion is removed, to avoid the performance overhead, based on discussions. With this, BCF and VCF will have different internal binary values/representations. As conversion is made for vcf, merge issue mentioned is fixed for vcf/vcf.gz. When input is bcf, the issue still exists and conversion may still be required. Performance issue will be looked further that it can be used for BCF as well. |
ad77218
to
0335a05
Compare
Fix for #1932
It was discussed earlier to keep the phasing data in VCF44 format internally as that easily resolves the issue mentioned (will necessitate some changes in bcftools as there are a few checks with v4.3 type phase values - in convert tests) with VCF checks.
Data needs a consistent binary representation, irrespective of VCF/BCF source, so the same conversion is needed for BCF data as well, from v4.x to v4.4 in terms of phasing values. This conversion is made on read, for bcf with version < v44. While writing, it is again converted that data is stored without the change made, if version < v44. This adds ~15% overhead it seems.
--removed the mention of a few changes that seemed required in bcftools, it worked fine without any such change with later checks--