Skip to content

Commit 4ae1d19

Browse files
committed
Update terminology to reflect ARG paper
1 parent 15ed49b commit 4ae1d19

File tree

3 files changed

+47
-24
lines changed

3 files changed

+47
-24
lines changed

args.md

Lines changed: 24 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,9 @@ parent to child nodes. Therefore a succinct tree sequence is equivalent to a
2424
[directed graph](https://en.wikipedia.org/wiki/Directed_graph),
2525
which is additionally annotated with genomic positions such that at each
2626
position, a path through the edges exists which defines a tree. This graph
27-
interpretation of a tree sequence is tightly connected to the concept of
27+
interpretation of a tree sequence maps very closely to the concept of
2828
an "ancestral recombination graph" (or ARG). See
29-
[this preprint](https://www.biorxiv.org/content/10.1101/2023.11.03.565466v1) for further details.
29+
[this preprint](https://www.biorxiv.org/content/10.1101/2023.11.03.565466v2) for further details.
3030

3131
## Full ARGs
3232

@@ -39,12 +39,16 @@ graph structure defined by that process, see e.g.
3939

4040
The term "ARG" is [often used](https://doi.org/10.1086%2F508901) to refer to
4141
a structure consisting of nodes and edges that describe the genetic genealogy of a set
42-
of sampled chromosomes which have evolved via a process of genetic inheritance combined
43-
with recombination. ARGs may contain not just nodes corresponding to genetic
44-
coalescence, but also additional nodes that correspond e.g. to recombination events.
45-
These "full ARGs" can be stored and analysed in
42+
of sampled chromosomes which have evolved via a process of inheritance combined
43+
with recombination. We use the term "full ARG" to describe a commonly-described type of
44+
ARG that contains not just nodes that correspond to
45+
coalescence of ancestral material, but also additional nodes that correspond to
46+
recombination events and common ancestor events that are not associated with
47+
coalescence in any of the local trees. Full ARGs can be stored and analysed in
4648
[tskit](https://tskit.dev) like any other tree sequence. A full ARG can be generated using
47-
{func}`msprime:msprime.sim_ancestry` with the `record_full_arg=True` option, as described
49+
{func}`msprime:msprime.sim_ancestry` by specifying `coalescing_segments_only=False` along with
50+
`additional_nodes = msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT`
51+
(or the equivalent `record_full_arg=True`) as described
4852
{ref}`in the msprime docs<msprime:sec_ancestry_full_arg>`:
4953

5054
```{code-cell}
@@ -58,8 +62,12 @@ parameters = {
5862
"random_seed": 333,
5963
}
6064
61-
ts_arg = msprime.sim_ancestry(**parameters, record_full_arg=True, discrete_genome=False)
62-
# NB: the strict Hudson ARG needs unique crossover positions (i.e. a continuous genome)
65+
ts_arg = msprime.sim_ancestry(
66+
**parameters,
67+
discrete_genome=False, # the strict Hudson ARG needs unique crossover positions (i.e. a continuous genome)
68+
coalescing_segments_only=False, # setting record_full_arg=True is equivalent to these last 2 parameters
69+
additional_nodes=msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT,
70+
)
6371
6472
print('Simulated a "full ARG" under the Hudson model:')
6573
print(
@@ -282,7 +290,12 @@ its simplified version:
282290
```{code-cell}
283291
large_sim_parameters = parameters.copy()
284292
large_sim_parameters["sequence_length"] *= 1000
285-
large_ts_arg = msprime.sim_ancestry(**large_sim_parameters, record_full_arg=True)
293+
large_ts_arg = msprime.sim_ancestry(
294+
**large_sim_parameters,
295+
discrete_genome=False,
296+
coalescing_segments_only=False,
297+
additional_nodes=msprime.NodeType.COMMON_ANCESTOR | msprime.NodeType.RECOMBINANT,
298+
)
286299
large_ts = large_ts_arg.simplify()
287300
288301
print(
@@ -312,7 +325,7 @@ difference between some classical ARG formulations, and the ARG formulation
312325
used in `tskit`. Classically, nodes in an ARG are taken to represent _events_
313326
(specifically, "common ancestor", "recombination", and "sampling" events),
314327
and genomic regions of inheritance are encoded by storing a specific breakpoint location on
315-
each recombination node. In contrast, [nodes](tskit:sec_data_model_definitions_node) in a `tskit`
328+
each recombination node. In contrast, {ref}`nodes<tskit:sec_data_model_definitions_node>` in a `tskit`
316329
ARG correspond to _genomes_. More crucially, inherited regions are defined by intervals
317330
stored on *edges* (via the {attr}`~Edge.left` and {attr}`~Edge.right` properties),
318331
rather than on nodes. Here, for example, is the edge table from our ARG:

terminology_and_concepts.md

Lines changed: 21 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -416,28 +416,38 @@ there are multiple, overlaid ancestral recombination events.
416416

417417
### Tree sequences and ARGs
418418

419-
Much of the literature on ancestral inference concentrates on the Ancestral Recombination
420-
Graph, or ARG, in which details of the position and potentially the timing of
421-
recombination events are explictly stored. Although a tree sequence *can* represent such
422-
an ARG, by incorporating nodes that represent recombination events (see the
423-
{ref}`sec_args` tutorial), this is not normally done for two reasons:
419+
::::{margin}
420+
:::{note}
421+
There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in a local tree.
422+
:::
423+
::::
424+
425+
The term "Ancestral Recombination Graph", or ARG, is commonly used to describe a genetic
426+
genealogy. In particular, many (but not all) authors use it to mean a genetic
427+
genealogy in which details of the position and potentially the timing of all
428+
recombination and common ancestor events are explictly stored. For clarity
429+
we refer to this sort of genetic genealogy as a "full ARG". Succinct tree sequences can
430+
represent many different sorts of ARGs, including "full ARGs", by incorporating extra
431+
non-coalescent nodes (see the {ref}`sec_args` tutorial). However, tree sequences are
432+
often shown and stored in {ref}`fully simplified<sec_simplification>` form,
433+
which omits these extra nodes. This is for two main reasons:
424434

425435
1. Many recombination events are undetectable from sequence data, and even if they are
426436
detectable, they can be logically impossible to place in the genealogy (as in the
427437
second SPR example above).
428-
2. The number of recombination events in the genealogy can grow to dominate the total
429-
number of nodes in the total tree sequence, without actually contributing to the
430-
realised sequences in the samples. In other words, recombination nodes are redundant
431-
to the storing of genome data.
438+
2. The number of recombination and non-coalescing common ancestor events in the genealogy
439+
quickly grows to dominate the total number of nodes in the tree sequence,
440+
without actually contributing to the mutations inherited by the samples.
441+
In other words, these nodes are redundant to the storing of genome data.
432442

433-
Therefore, compared to an ARG, you can think of a standard tree sequence as simply
443+
Therefore, compared to a full ARG, you can think of a simplified tree sequence as
434444
storing the trees *created by* recombination events, rather than attempting to record the
435445
recombination events themselves. The actual recombination events can be sometimes be
436446
inferred from these trees but, as we have seen, it's not always possible. Here's another
437447
way to put it:
438448

439449
> "an ARG encodes the events that occurred in the history of a sample,
440-
> whereas a tree sequence encodes the outcome of those events"
450+
> whereas a [simplified] tree sequence encodes the outcome of those events"
441451
> ([Kelleher _et al._, 2019](https://doi.org/10.1534/genetics.120.303253))
442452
443453

what_is.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -307,8 +307,8 @@ plt.show()
307307
::::{margin}
308308
:::{note}
309309
The genetic genealogy is sometimes referred to as an ancestral recombination graph,
310-
or ARG, and there are {ref}`close similarities<sec_concepts_args>` between ARGs
311-
and tree sequences (see the {ref}`ARG tutorial<sec_args>`)
310+
or ARG, and one way to think of tskit tree sequence is as a way
311+
to store various different sorts of ARGs (see the {ref}`ARG tutorial<sec_args>`)
312312
:::
313313
::::
314314

0 commit comments

Comments
 (0)