Skip to content

ingest: How to handle segmented viruses #59

@joverlee521

Description

@joverlee521

Related to https://github.com/nextstrain/private/issues/102 and #50

Writing out current methods for ingesting segmented viruses. These are slightly different from gene specific builds like RSV or measles because the upstream records in NCBI GenBank are per segment rather than per genome.

Avian flu example

The segment and strain fields are pretty standardized for the recent H5N1 outbreak, so we are able to directly use that metadata to match segments of the same metadata record.

  1. Pull segment/strain name from NCBI Virus.
  2. All data is processed through the usual curation pipeline.
  3. Then split into segment metadata + sequences
  4. Finally, loop through the segment metadata to add a n_segments column tracking how many segments are linked to the metadata record. There is no "merging" of segment metadata, we just use the metadata for the HA segment.

This results in 1 metadata TSV + 8 FASTAs where the segment sequences are linked to the metadata via a unique strain.

Lassa example

I had originally thought we could replicate the avian flu ingest in lassa, but the lack of standardized segment and strain fields makes this difficult. @j23414 has implemented an alternative method in nextstrain/lassa#12.

  1. Workflow uses the usual datasets download and curation pipeline.
  2. Use nextclade run to align records to L and S reference sequences to separate the segment sequences.
  3. Use augur filter subset the metadata by segment sequences using accession as the metadata id column.
  4. The original metadata + sequences file that contains all records are kept for a record of samples that failed to align to either L or S.

This results in 3 metadata TSVs + 3 FASTAs. The metadata/sequences with all records are only used for debugging purposes. The phylogenetic workflow would start from the L/S metadata.tsv + sequences.fasta, where the records are linked by a unique accession.

I toyed with the idea of creating strain names for lassa, but the lack of data makes it difficult to follow our usual pattern of <location>/<sample_id>/<year>. The lack of linked BioSample records also prevents us from using the BioSample accession to link segments.

Other viruses?

I think we got lucky with the H5N1 data. It is highly likely for other segmented viruses to have less standardized data like lassa. The default method for ingesting segmented virus data from NCBI should probably follow the lassa example.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions