Skip to content

Commit b855306

Browse files
chore(manuscript): update v2 diff file
1 parent 1d6b1e2 commit b855306

File tree

1 file changed

+122
-106
lines changed
  • reproducibility/manuscript

1 file changed

+122
-106
lines changed

reproducibility/manuscript/v2.tex

Lines changed: 122 additions & 106 deletions
Original file line numberDiff line numberDiff line change
@@ -661,34 +661,133 @@ \subsection{Model formulation}\label{sec-methods-model}
661661
662662
We assume the dynamical gene expression is determined by the RNA
663663
splicing process, and infer the unspliced and spliced gene expression
664-
level from the differential equations proposed in velocyto
665-
\citep{La_Manno2018-lj} and scVelo \citep{Bergen2020-pj} \begin{align}
666-
\frac{d u\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}}
667-
&= \alpha^{\left(k_{cg}\right)}-\beta_g u\left(\tau^{\left(k_{cg}\right)}\right),
668-
\label{eq-dudt}\\
669-
\frac{d s\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}}
670-
&= \beta_g u\left(\tau^{\left(k_{cg}\right)}\right)
671-
-\gamma_g s\left(\tau^{\left(k_{cg}\right)}\right). \label{eq-dsdt}
664+
level from the ordinary differential equation (ODEs) proposed in
665+
velocyto \citep{La_Manno2018-lj} and scVelo \citep{Bergen2020-pj}
666+
\begin{align}
667+
\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \label{eq-dudt}\\
668+
\frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label{eq-dsdt}
669+
\end{align} where \(u(t), s(t)\) are the unspliced and spliced
670+
expression levels of a gene at time \(t\) under a transcription rate
671+
\(\alpha(t)\) with possible temporal dependence, splicing rate
672+
\(\beta\), and degradation rate \(\gamma\). We specify this model to a
673+
setting that depends on cell \(c\) and gene \(g\) as follows:
674+
\begin{align}
675+
\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\
676+
\frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-dsdt}.
672677
\end{align} In the equation, the subscript \(c\) is the cell dimension,
673-
\(g\) is the gene dimension,
674-
\(\left( u\left( \tau^{(k_{cg})} \right), s\left( \tau^{(k_{cg})} \right) \right)\)
675-
are the unspliced and spliced expression functions given the change of
676-
time per cell and gene. \(\tau_{cg}\) represents the displacement of
677-
time per cell and gene with \begin{align}
678-
\tau^{(k_{cg})} &= \operatorname{softplus} \left( t_{c} - {t_{0}^{(k_{cg})}}_g \right) \\
679-
& = \log( 1 + \exp (t_c - {t_{0}^{(k_{cg})}}_g)),
680-
\end{align} in which \(t_c\) is the shared time per cell,
681-
\({t_{0}^{(kcg)}}_g\) is the gene-specific switching time. Each cell and
682-
gene combination has its transcriptional state
678+
\(g\) is the gene dimension, \(\left( u_{cg}(t), s_{cg}(t) \right)\) are
679+
the unspliced and spliced expression functions given the change of time
680+
per cell and gene. We restrict attention to piecewise-constant
681+
\(\alpha_{cg}(t)\) to capture gene-specific activation and repression.
682+
We take special care to model a gene- and cell-specific switching time
683+
that marks a single transition from activation to repression by
684+
introducing a Bernoulli variable \(k_{cg}\) to model unknown activation
685+
state. We assume our cell-by-gene data-matrix arrive as observations of
686+
Poisson-counts related to the solution of the above ODEs at unknown
687+
times \(\tau_{cg}\), which is modeled as a relationship between an
688+
unknown latent time shared across each cell, \(t_c\), and unknown
689+
gene-specific time-offsets \(t_{0,g}\) where all read counts for a
690+
single cell occurred at an unknown, but shared latent time \(t_c\).
691+
These relative times are also used to parametrize the Bernoulli process
692+
for \(k_{cg}\). Importantly, we recognize that the initial conditions
693+
are in fact unknown.
694+
695+
We propose and study two models: Model 1 assumes that spliced and
696+
unspliced concentrations are both 0 at time 0; Model 2 considers these
697+
initial conditions as unknowns with a log-Normal prior distribution. In
698+
general, the solution space of ODEs becomes much richer when considered
699+
over a domain of initial conditions (as opposed to a single point);
700+
indeed, this affords Model 2 much greater expressivity. For clarity, we
701+
first present the generative framework for both models, then provide
702+
further interpretation and intuition.
703+
704+
First, we introduce the generative model that describes the various
705+
unobserved times: \begin{align}
706+
% unit lognormal t_c
707+
t_c &\sim \text{LogNormal}(0, 1) \\
708+
% gene-specific t_0
709+
t^{(0)}_{0,g} &\sim \text{LogNormal}(0, 1) \\
710+
% switching time
711+
\Delta \textrm{switching}_g &\sim \text{LogNormal}(0, 1) \\
712+
% gene-specific t_1
713+
t^{(1)}_{0,g} &= t^{(0)}_{0,g} + \Delta \textrm{switching}_g \\
714+
%cell-gene-specific activation state
715+
k_{cg} &\sim \text{Bernoulli}(\textrm{logits}=t_c - t^{(1)}_{0,g}) \\
716+
% cell-gene-specific latent time
717+
\tau_{cg} &= \text{softplus}(t_c - t^{(k_{cg})}_{0,g}).
718+
\end{align} Here, \(\tau_{cg}\) represents the displacement of time per
719+
cell and gene with \begin{align}
720+
\text{softplus}(t) := \log( 1 + e^t).
721+
\end{align} Recall that \(t_c\) is the shared time per cell,
722+
\(t^{(k_{cg})}_{0,g}\) is the gene-specific switching time. Each cell
723+
and gene combination has its transcriptional state
683724
\(k_{cg} \in \{ 0, 1 \}\), where \(0\) indicates the activation state
684725
and \(1\) indicates the expression state. Each gene has two switching
685-
times for representing activation and repression: \({t_{0}^{(0)}}_g\) is
726+
times for representing activation and repression: \(t^{(0)}_{0,g}\) is
686727
the first switching time corresponding to when the gene expression
687-
starts to be activated, \({t_0^{(1)}}_g\) is the second switching time
688-
corresponding to when the gene expression starts to be repressed. We
689-
note that \(\alpha^{(1)}\) is shared for all the genes, while
690-
\({\alpha^{(0)}}_g\) is learned independently for each gene.
728+
starts to be activated, \(t^{(1)}_{0,g}\) is the second switching time
729+
corresponding to when the gene expression starts to be repressed, and is
730+
determined by the first switching time and the gene-specific switching
731+
time \(\Delta \text{switching}_g\). The cell-gene-specific activation
732+
state \(k_{cg}\) is a Bernoulli random variable with logits equal to the
733+
difference between the cell's shared time \(t_c\) and the time
734+
\(t^{(1)}_{0,g}\) when the gene expression starts to be repressed.
735+
736+
Next we introduce the priors for the splicing parameters (where the
737+
activation rate \(\alpha\) depends on the activation state \(k_{cg}\)
738+
from above): \begin{align}
739+
\alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\
740+
\beta_g &\sim \text{LogNormal}(0, 1) \\
741+
\gamma_g &\sim \text{LogNormal}(0, 1) \\
742+
\alpha_{cg} &= \begin{cases}
743+
\alpha^{(0)}_g & \text{if } k_{cg} = 0 \\
744+
0 & \text{if } k_{cg} = 1
745+
\end{cases}
746+
\end{align}
747+
\textbf{Note that $\alpha^{(1)}$ is shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for
748+
each gene. MATT: this was in the old text, but I think $\alpha^{(1)}$ is no longer used based on conversations with Alvin?}
749+
750+
Now, we describe the priors for the initial conditions, noting that this
751+
is the only difference between Model 1 and Model 2: \begin{align}
752+
\hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases}
753+
(0, 0) & \text{Model 1} \\
754+
(\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2}
755+
\end{cases} \\
756+
u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases}
757+
\hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\
758+
\textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1
759+
\end{cases}
760+
\end{align}
691761
762+
We define the ODE solution at time \(\tau_{cg}\) as: \begin{equation}
763+
\hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big).
764+
\end{equation}
765+
766+
Next, we define the observation model that gives rise to the observed
767+
counts as: \begin{align}
768+
\mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\
769+
\sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\
770+
\sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\
771+
\eta^{(u)}_c &\sim \text{Normal}\Big(\mu^{(u)}_c, \ \sigma^{(u)}_c\Big) \\
772+
\eta^{(s)}_c &\sim \text{Normal}\Big(\mu^{(s)}_c, \ \sigma^{(s)}_c\Big) \\
773+
\hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\
774+
\lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\
775+
\lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\
776+
\hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \\
777+
\hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big)
778+
\end{align} Here, we use
779+
\({u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}\) to denote the
780+
observed unspliced and spliced counts for cell \(c\) and gene \(g\). We
781+
use \(\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}\) to
782+
denote our generative model's prediction of these unspliced and spliced
783+
expression levels. The generative process for modeling these observed
784+
read counts given denoised gene transcript expression level
785+
\(\hat{u}_{cg}, \hat{s}_{cg}\) considers the expected number of observed
786+
reads for a given gene in a given cell as the number of transcripts
787+
times the ratio of the cell's total reads to total transcripts.
788+
\textbf{Improve descriptions of how noise is modeled in the observation model.}
789+
790+
\textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.}
692791
The analytic solution of the differential equations to predict spliced
693792
and unspliced gene expression given their parameters is derived by the
694793
authors of scVelo and a theoretical RNA velocity study
@@ -753,89 +852,6 @@ \subsection{Model formulation}\label{sec-methods-model}
753852
+\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}.
754853
\end{align}
755854
756-
We use these solutions to formulate an end-to-end probabilistic
757-
generative model that relates prior distributions on kinetic parameters
758-
to a distribution on pairs of observed unspliced and spliced read count
759-
matrices
760-
761-
\begin{align}
762-
\alpha^{(0)}{ }_g &\sim \operatorname{LogNormal}(0,1), \\
763-
\beta_g &\sim \operatorname{LogNormal}(0,1), \\
764-
\gamma_g &\sim \operatorname{LogNormal}(0,1), \\
765-
&\hskip -18pt \Delta \text { switching }_g \sim \operatorname{LogNormal}(0,1), \\
766-
t_0^{\left(k_{c g}\right)} &= \left\{
767-
\begin{array}{l}
768-
t_0^{(0)}{ }_g \sim \operatorname{Normal}(0,1), k_{c g}=0 \\
769-
t_0^{(1)}{ }_g=t_0^{(0)}{ }_g+\Delta \text { switching }_g, \\
770-
\quad k_{c g}=1
771-
\end{array}\right. \\
772-
t_c &\sim \operatorname{LogNormal}(0,1), \\
773-
k_{c g} &\sim \text{Bernoulli} \left( \text{logits}= t_c-t_0^{(1)} \right), \\
774-
\tau^{\left(k_{c g}\right)}
775-
&= \operatorname{softplus}\left(t_c-t_0^{\left(k_{c g}\right)}{ }_g\right), \\
776-
u_{c g}
777-
&= \text { Measurement }_u \left( u\left(\tau^{\left(k_{c g}\right)}\right) ;
778-
u_{c g}^{obs}\right), \\
779-
s_{c g}
780-
&= \text { Measurement }_s \left( s\left(\tau^{(k_{c g})}\right) ;
781-
s_{c g}^{obs}\right).
782-
\end{align} \(u\left(\tau^{\left(k_{c g}\right)}\right)\) and
783-
\(s\left(\tau^{(k_{c g})}\right)\) are are called the denoised gene
784-
expression calculated from the velocity analytic solution input with the
785-
kinetics random variables. \(u_{cg}\) and \(s_{cg}\) are the spliced and
786-
unspliced read count sampled from the Poisson models. \(u_{cg}^{obs}\)
787-
and \(s_{cg}^{obs}\) are the observed spliced and unspliced read count
788-
tables. The generative process
789-
790-
\(\text{Measurement}(\cdot)\) for observed unspliced read counts given
791-
denoised unspliced gene transcript expression level
792-
\(u\left(\tau^{(k_{cg})}\right)\) (and identical for observed spliced
793-
read counts) models the expected number of observed reads for a given
794-
gene in a given cell as the number of transcripts times the ratio of the
795-
cell's total reads to total transcripts \begin{align}
796-
u_c^{\hat{obs}} &= \sum_g u_{c g}^{obs}, \\
797-
\hat{u}_c &= \sum_g u\left( \tau^{(k_{c g})}\right), \\
798-
\eta_c^{(u)} &\sim \operatorname{Normal}\left(
799-
u_c^{\hat{obs}_c},
800-
\operatorname{std} \left(u_c^{\hat{obs}}\right)
801-
\right), \\
802-
\mu_{c g}^{(u)} &= \log \left(u\left(\tau^k{ }_{c g}\right)\right)
803-
+\log \left(\eta_c^{(u)}\right)-\log \left(\hat{u}_c\right), \\
804-
u_{c g}^{obs} &\sim
805-
\operatorname{Poisson}\left(\lambda=\exp \left(\mu_{c g}^{(u)}\right)\right).
806-
\end{align}
807-
808-
For the first Pyro-Velocity model (Model 1), we constrain the shared
809-
time to be strictly larger than \(t_{0}^{(0)}\) by introducing auxiliary
810-
random variables \[
811-
\text{t\_constraint}_{cg}
812-
\sim \text{Bernoulli} \left( \text{logits} = t_c - {t_{0}^{(0)}}_g \right),
813-
\] and setting their values to \(1\), and we set the initial condition
814-
per gene to be \begin{align}
815-
\left( {u_{0}^{(k_{cg})}}_g , {s_{0}^{(k_{cg})}}_g \right) &= \left\{
816-
\begin{array}{l}
817-
(0,0), k_{c g}=0 \\
818-
\bigg( {u \left( \Delta \text { switching }_g \right)}_g,\\
819-
\quad {s \left( \Delta \text { switching }_g \right)}_g \bigg), \\
820-
\quad k_{c g}=1
821-
\end{array}\right.
822-
\end{align} For the extended Pyro -Velocity model (Model 2), we remove
823-
the shared time constraint \(\text{t\_constraint}_{cg}\), thus allowing
824-
a time lag per gene that might be caused by delayed gene activation and
825-
set the initial condition per gene as random variables that are strictly
826-
positive \(\left({u_{0}^{(0)}}_g,
827-
{s_{0}^{(0)}}_g\right)\), which allow genes having a basal expression
828-
level before gene activation. Then, we compute the gene expression at
829-
the second switching time as \begin{align}
830-
({u_{0}^{(1)}}_g, {s_{0}^{(1)}}_g) &=
831-
\bigg( {u \left( \Delta \text { switching }_g \right)}_g, \nonumber \\
832-
& \qquad {s \left( \Delta \text { switching }_g \right)}_g \bigg),
833-
\end{align} which shares the same initial condition
834-
\(\left({u_{0}^{(0)}}_g, {s_{0}^{(0)}}_g\right)\) where \begin{align}
835-
{u_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1),\\
836-
{s_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1).
837-
\end{align}
838-
839855
\subsection{Variational inference}\label{sec-methods-inference}
840856
841857
Given observations

0 commit comments

Comments
 (0)