Speed up of VCF parsing / writing with FORMAT fields. #1081
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
If we're reading and writing VCF without intervening change to BCF or
use of bcf_unpack with BCF_UN_FMT (so no querying and/or modification
of the per-sample FORMAT columns), then we keep the data in string
form instead of binary bcf form.
The bcf binary form is held in kstring b->indiv. We also hold the
string format there too, but use b->unpacked & VCF_UN_FMT as a
bit-flag to indicate whether this is encoding as VCF or BCF.
The benefit is reading is considerably faster for many-sample files
(~4 fold for a 2.5k sample 1000 genomes test) and similarly (~3 fold)
for a read/write loop to VCF as we don't have to re-encode the data
from uBCF to VCF either.
If we're converting to BCF, or doing a query, we still need the call
to vcf_parse_format as before, which reformats b->indiv from VCF to
BCF internally. This is done on the fly.
There is some nastiness to how this works sadly as bcf_unpack just
takes the bcf record and has no access to the header, but this is
needed for parsing. Therefore we stash a copy of it in the kstring
along with the text itself, but this assumes that no one is doing
header oddities such as reading the file, freeing the header, and
then attempting to unpack the contents. It feels like a safe
assumption, although it's not something that has ever been explicitly
spelt out before.
Note this causes a few bcftools test failures due to fixups that
happen when doing a full conversion to BCF. For example
However this now causes a test failure. Eg test/noroundtrip.vcf has
FORMAT of "GT:GT 0/1" but if we force into BCF and back out again we
get "GT:GT 2,4:.".