ingest: How to handle segmented viruses

Related to https://github.com/nextstrain/private/issues/102 and https://github.com/nextstrain/pathogen-repo-guide/issues/50

Writing out current methods for ingesting segmented viruses. These are slightly different from gene specific builds like RSV or measles because the upstream records in NCBI GenBank are per segment rather than per genome.

### Avian flu example

The `segment` and `strain` fields are pretty standardized for the recent H5N1 outbreak, so we are able to directly use that metadata to match segments of the same metadata record. 

1. [Pull segment/strain name from NCBI Virus](https://github.com/nextstrain/avian-flu/blob/74b95ff842c0931cd85dbb90d21344f2190aa55a/ingest/build-configs/ncbi/bin/ncbi-virus-url#L65-L66). 
2. All data is processed through the usual curation pipeline.
3. Then [split into segment metadata + sequences](https://github.com/nextstrain/avian-flu/blob/74b95ff842c0931cd85dbb90d21344f2190aa55a/ingest/build-configs/ncbi/rules/curate.smk#L108)
4. Finally, loop through the segment metadata [to add a `n_segments` column](https://github.com/nextstrain/avian-flu/blob/74b95ff842c0931cd85dbb90d21344f2190aa55a/ingest/rules/merge_segment_metadata.smk#L7) tracking how many segments are linked to the metadata record. There is no "merging" of segment metadata, we just use the metadata for the HA segment.

This results in 1 metadata TSV + 8 FASTAs where the segment sequences are linked to the metadata via a unique `strain`. 
 
### Lassa example 

I had originally thought we could replicate the avian flu ingest in lassa, but the [lack of standardized segment](https://github.com/nextstrain/lassa/pull/12#issuecomment-2251439504) and strain fields makes this difficult. @j23414 has implemented an alternative method in https://github.com/nextstrain/lassa/pull/12.

1. Workflow uses the usual `datasets download` and curation pipeline. 
2. Use `nextclade run` to align records to L and S reference sequences to separate the segment sequences. 
3. Use `augur filter` subset the metadata by segment sequences using `accession` as the metadata id column. 
4. The original metadata + sequences file that contains _all_ records are kept for a record of samples that failed to align to either L or S. 

This results in 3 metadata TSVs + 3 FASTAs. The metadata/sequences with _all_ records are only used for debugging purposes. The phylogenetic workflow would start from the L/S metadata.tsv + sequences.fasta, where the records are linked by a unique `accession`.  

I [toyed with the idea of creating strain names](https://github.com/nextstrain/lassa/pull/12#issuecomment-2251520598) for lassa, but the lack of data makes it difficult to follow our usual pattern of `<location>/<sample_id>/<year>`. The lack of linked BioSample records also prevents us from using the BioSample accession to link segments. 

### Other viruses?

I think we got lucky with the H5N1 data. It is highly likely for other segmented viruses to have less standardized data like lassa. The default method for ingesting segmented virus data from NCBI should probably follow the lassa example. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ingest: How to handle segmented viruses #59

Avian flu example

Lassa example

Other viruses?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ingest: How to handle segmented viruses #59

Description

Avian flu example

Lassa example

Other viruses?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions