Merge pull request #424 from nspope/docs-cite-singer

hyanwong · web-flow · commit 1956970eeaca · 2024-07-27T05:37:22.000+01:00
Make sure to cite SINGER in docs for rescaling idea
diff --git a/docs/methods.md b/docs/methods.md
@@ -68,7 +68,7 @@ Pros
 : Old nodes do not suffer from time-discretisation issues caused by forcing
     bounds on the oldest times
 : Iterative updating properly accounts for cycles in the genealogy
-: No need to specify prior times
+: No need to specify node-specific priors; a mixture "prior" (fit by expectation-maximization) is used to regularise the roots.
 : Can account for variable population sizes using rescaling
 
 Cons
@@ -118,15 +118,23 @@ ts = tsdate.date(input_ts, mutation_rate=1e-8, progress=True)
 (sec_rescaling)=
 #### Rescaling
 
-During each EP step, the `variational_gamma` method implements a further process
-that we call *rescaling*, and which can help to deal with the effects of variable population
-size though time. Basically, time is broken up into a number of intervals, and times within
-intervals are simultaneously scaled such that the expected density of mutations along each
-path from a sample to the root best matches the mutational density predicted from the
-user-provided mutation rate. The number of intervals can be specified using the
-`rescaling_intervals` parameter. If set to 0, no rescaling is performed; this means that
-dates may be inaccurately estimated if the dataset comes from a set of samples with a complex
-demographic history.
+During each EP step, the `variational_gamma` method implements a further
+process called *rescaling*, and which can help to deal with the effects of
+variable population size though time. This is based on an algorithm introduced
+by the ARG inference software
+[SINGER](https://doi.org/10.1101/2024.03.16.585351) (Deng et al 2024) that
+rescales node ages by matching observed and expected segregating sites within
+time windows.
+Basically, time is broken up into a number of intervals, and times within
+intervals are simultaneously scaled such that the expected density of mutations
+along each path from a sample to the root best matches the mutational density
+predicted from the user-provided mutation rate. The number of intervals can be
+specified using the `rescaling_intervals` parameter. If set to 0, no rescaling
+is performed; this means that dates may be inaccurately estimated if the
+dataset comes from a set of samples with a complex demographic history.
+`tsdate` uses a modified version of Deng et al's algorithm that works on gamma
+natural parameters rather than point estimates, and that is not biased by the
+artefactual polytomies introduced by `tsinfer` for the sake of compression.
 
 TODO: describe the rescaling step in more detail. Could also link to [the population size docs](sec_popsize)
 
@@ -174,4 +182,4 @@ have no mapped mutations (e.g. in the centromere), which can be removed by
 
 The `maximization` approach is slightly less accurate empirically,
 and will not return true posteriors, but is theoretically robust and
-additionally is always numerically stable.
+additionally is always numerically stable.