@@ -661,34 +661,133 @@ \subsection{Model formulation}\label{sec-methods-model}
661
661
662
662
We assume the dynamical gene expression is determined by the RNA
663
663
splicing process, and infer the unspliced and spliced gene expression
664
- level from the differential equations proposed in velocyto
665
- \citep {La_Manno2018 -lj } and scVelo \citep {Bergen2020 -pj } \begin {align }
666
- \frac {d u\left (\tau ^{\left (k_{cg}\right )}\right )}{d \tau ^{\left (k_{cg}\right )}}
667
- &= \alpha ^{\left (k_{cg}\right )}-\beta _g u\left (\tau ^{\left (k_{cg}\right )}\right ),
668
- \label {eq-dudt }\\
669
- \frac {d s\left (\tau ^{\left (k_{cg}\right )}\right )}{d \tau ^{\left (k_{cg}\right )}}
670
- &= \beta _g u\left (\tau ^{\left (k_{cg}\right )}\right )
671
- -\gamma _g s\left (\tau ^{\left (k_{cg}\right )}\right ). \label {eq-dsdt }
664
+ level from the ordinary differential equation (ODEs) proposed in
665
+ velocyto \citep {La_Manno2018 -lj } and scVelo \citep {Bergen2020 -pj }
666
+ \begin {align }
667
+ \frac {du}{dt} &= \alpha (t) - \beta u, \quad u(0) = u_0 \label {eq-dudt }\\
668
+ \frac {ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label {eq-dsdt }
669
+ \end {align } where \( u(t), s(t)\) are the unspliced and spliced
670
+ expression levels of a gene at time \( t\) under a transcription rate
671
+ \( \alpha (t)\) with possible temporal dependence, splicing rate
672
+ \( \beta \) , and degradation rate \( \gamma \) . We specify this model to a
673
+ setting that depends on cell \( c\) and gene \( g\) as follows:
674
+ \begin {align }
675
+ \frac {du_{cg}}{dt} &= \alpha _{cg}(t) - \beta _{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label {eq-dudt }\\
676
+ \frac {ds_{cg}}{dt} &= \beta _{g} u_{cg} - \gamma _{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label {eq-dsdt }.
672
677
\end {align } In the equation, the subscript \( c\) is the cell dimension,
673
- \( g\) is the gene dimension,
674
- \( \left ( u\left ( \tau ^{(k_{cg})} \right ), s\left ( \tau ^{(k_{cg})} \right ) \right )\)
675
- are the unspliced and spliced expression functions given the change of
676
- time per cell and gene. \( \tau _{cg}\) represents the displacement of
677
- time per cell and gene with \begin {align }
678
- \tau ^{(k_{cg})} &= \operatorname {softplus} \left ( t_{c} - {t_{0}^{(k_{cg})}}_g \right ) \\
679
- & = \log ( 1 + \exp (t_c - {t_{0}^{(k_{cg})}}_g)),
680
- \end {align } in which \( t_c\) is the shared time per cell,
681
- \( {t_{0}^{(kcg)}}_g\) is the gene-specific switching time. Each cell and
682
- gene combination has its transcriptional state
678
+ \( g\) is the gene dimension, \( \left ( u_{cg}(t), s_{cg}(t) \right )\) are
679
+ the unspliced and spliced expression functions given the change of time
680
+ per cell and gene. We restrict attention to piecewise-constant
681
+ \( \alpha _{cg}(t)\) to capture gene-specific activation and repression.
682
+ We take special care to model a gene- and cell-specific switching time
683
+ that marks a single transition from activation to repression by
684
+ introducing a Bernoulli variable \( k_{cg}\) to model unknown activation
685
+ state. We assume our cell-by-gene data-matrix arrive as observations of
686
+ Poisson-counts related to the solution of the above ODEs at unknown
687
+ times \( \tau _{cg}\) , which is modeled as a relationship between an
688
+ unknown latent time shared across each cell, \( t_c\) , and unknown
689
+ gene-specific time-offsets \( t_{0,g}\) where all read counts for a
690
+ single cell occurred at an unknown, but shared latent time \( t_c\) .
691
+ These relative times are also used to parametrize the Bernoulli process
692
+ for \( k_{cg}\) . Importantly, we recognize that the initial conditions
693
+ are in fact unknown.
694
+
695
+ We propose and study two models: Model 1 assumes that spliced and
696
+ unspliced concentrations are both 0 at time 0; Model 2 considers these
697
+ initial conditions as unknowns with a log-Normal prior distribution. In
698
+ general, the solution space of ODEs becomes much richer when considered
699
+ over a domain of initial conditions (as opposed to a single point);
700
+ indeed, this affords Model 2 much greater expressivity. For clarity, we
701
+ first present the generative framework for both models, then provide
702
+ further interpretation and intuition.
703
+
704
+ First, we introduce the generative model that describes the various
705
+ unobserved times: \begin {align }
706
+ % unit lognormal t_c
707
+ t_c &\sim \text {LogNormal}(0, 1) \\
708
+ % gene-specific t_0
709
+ t^{(0)}_{0,g} &\sim \text {LogNormal}(0, 1) \\
710
+ % switching time
711
+ \Delta \textrm {switching}_g &\sim \text {LogNormal}(0, 1) \\
712
+ % gene-specific t_1
713
+ t^{(1)}_{0,g} &= t^{(0)}_{0,g} + \Delta \textrm {switching}_g \\
714
+ % cell-gene-specific activation state
715
+ k_{cg} &\sim \text {Bernoulli}(\textrm {logits}=t_c - t^{(1)}_{0,g}) \\
716
+ % cell-gene-specific latent time
717
+ \tau _{cg} &= \text {softplus}(t_c - t^{(k_{cg})}_{0,g}).
718
+ \end {align } Here, \( \tau _{cg}\) represents the displacement of time per
719
+ cell and gene with \begin {align }
720
+ \text {softplus}(t) := \log ( 1 + e^t).
721
+ \end {align } Recall that \( t_c\) is the shared time per cell,
722
+ \( t^{(k_{cg})}_{0,g}\) is the gene-specific switching time. Each cell
723
+ and gene combination has its transcriptional state
683
724
\( k_{cg} \in \{ 0, 1 \} \) , where \( 0\) indicates the activation state
684
725
and \( 1\) indicates the expression state. Each gene has two switching
685
- times for representing activation and repression: \( {t_{0} ^{(0)}}_g \) is
726
+ times for representing activation and repression: \( t ^{(0)}_{0,g} \) is
686
727
the first switching time corresponding to when the gene expression
687
- starts to be activated, \( {t_0^{(1)}}_g\) is the second switching time
688
- corresponding to when the gene expression starts to be repressed. We
689
- note that \( \alpha ^{(1)}\) is shared for all the genes, while
690
- \( {\alpha ^{(0)}}_g\) is learned independently for each gene.
728
+ starts to be activated, \( t^{(1)}_{0,g}\) is the second switching time
729
+ corresponding to when the gene expression starts to be repressed, and is
730
+ determined by the first switching time and the gene-specific switching
731
+ time \( \Delta \text {switching}_g\) . The cell-gene-specific activation
732
+ state \( k_{cg}\) is a Bernoulli random variable with logits equal to the
733
+ difference between the cell's shared time \( t_c\) and the time
734
+ \( t^{(1)}_{0,g}\) when the gene expression starts to be repressed.
735
+
736
+ Next we introduce the priors for the splicing parameters (where the
737
+ activation rate \( \alpha \) depends on the activation state \( k_{cg}\)
738
+ from above): \begin {align }
739
+ \alpha ^{(0)}_g &\sim \text {LogNormal}(0, 1) \\
740
+ \beta _g &\sim \text {LogNormal}(0, 1) \\
741
+ \gamma _g &\sim \text {LogNormal}(0, 1) \\
742
+ \alpha _{cg} &= \begin {cases }
743
+ \alpha ^{(0)}_g & \text {if } k_{cg} = 0 \\
744
+ 0 & \text {if } k_{cg} = 1
745
+ \end {cases }
746
+ \end {align }
747
+ \textbf {Note that $ \alpha ^{(1)}$ is shared for all the genes, while $ {\alpha ^{(0)}}_g$ is learned independently for
748
+ each gene. MATT: this was in the old text, but I think $ \alpha ^{(1)}$ is no longer used based on conversations with Alvin? }
749
+
750
+ Now, we describe the priors for the initial conditions, noting that this
751
+ is the only difference between Model 1 and Model 2: \begin {align }
752
+ \hat {u}^{(0)}_{cg}, \hat {s}^{(0)}_{cg} &\sim \begin {cases }
753
+ (0, 0) & \text {Model 1} \\
754
+ (\text {LogNormal}(0, 1), \text {LogNormal}(0, 1)) & \text {Model 2}
755
+ \end {cases } \\
756
+ u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin {cases }
757
+ \hat {u}^{(0)}_{cg}, \hat {s}^{(0)}_{cg} & \text {if } k_{cg} = 0 \\
758
+ \textrm {ODESolve}\Big ( \hat {u}^{(0)}_{cg}, \hat {s}^{(0)}_{cg}, \alpha ^{(0)}_g, \beta _g, \gamma _g; \ T_0=0, T_1=\Delta \textrm {switching}_g \Big ) & \text {if } k_{cg} = 1
759
+ \end {cases }
760
+ \end {align }
691
761
762
+ We define the ODE solution at time \( \tau _{cg}\) as: \begin {equation }
763
+ \hat {u}_{cg}, \hat {s}_{cg} = \text {ODESolve}\Big ( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha _{cg}, \beta _g, \gamma _g; \ T_0=0, T_1=\tau _{cg} \Big ).
764
+ \end {equation }
765
+
766
+ Next, we define the observation model that gives rise to the observed
767
+ counts as: \begin {align }
768
+ \mu ^{(u)}_c &= \sum _{g=1}^G {u}^{\text {(obs)}}_{cg}, \quad \mu ^{(s)}_c = \sum _{g=1}^G {s}^{\text {(obs)}}_{cg} \\
769
+ \sigma ^{(u)}_c &= \sqrt {\frac {1}{G} \sum _{g=1}^G \left ( u_{cg}^{\text {(obs)}} - \mu ^{(u)}_c \right )^2} \\
770
+ \sigma ^{(s)}_c &= \sqrt {\frac {1}{G} \sum _{g=1}^G \left ( s_{cg}^{\text {(obs)}} - \mu ^{(s)}_c \right )^2} \\
771
+ \eta ^{(u)}_c &\sim \text {Normal}\Big (\mu ^{(u)}_c, \ \sigma ^{(u)}_c\Big ) \\
772
+ \eta ^{(s)}_c &\sim \text {Normal}\Big (\mu ^{(s)}_c, \ \sigma ^{(s)}_c\Big ) \\
773
+ \hat {\mu }^{(u)}_c &= \sum _{g=1}^G \hat {u}_{cg}, \quad \hat {\mu }^{(s)}_c = \sum _{g=1}^G \hat {s}_{cg} \\
774
+ \lambda ^{(u)}_{cg} &= \log (\hat {u}_{cg}) + \log (\eta ^{(u)}_{c}) - \log (\hat {\mu }^{(u)}_c) \\
775
+ \lambda ^{(s)}_{cg} &= \log (\hat {s}_{cg}) + \log (\eta ^{(s)}_{c}) - \log (\hat {\mu }^{(s)}_c) \\
776
+ \hat {u}^{\text {(obs)}}_{cg} &\sim \text {Poisson}\Big (\exp (\lambda ^{(u)}_{cg})\Big ) \\
777
+ \hat {s}^{\text {(obs)}}_{cg} &\sim \text {Poisson}\Big (\exp (\lambda ^{(s)}_{cg})\Big )
778
+ \end {align } Here, we use
779
+ \( {u}^{\text {(obs)}}_{cg}, {s}^{\text {(obs)}}_{cg}\) to denote the
780
+ observed unspliced and spliced counts for cell \( c\) and gene \( g\) . We
781
+ use \( \hat {u}^{\text {(obs)}}_{cg}, \hat {s}^{\text {(obs)}}_{cg}\) to
782
+ denote our generative model's prediction of these unspliced and spliced
783
+ expression levels. The generative process for modeling these observed
784
+ read counts given denoised gene transcript expression level
785
+ \( \hat {u}_{cg}, \hat {s}_{cg}\) considers the expected number of observed
786
+ reads for a given gene in a given cell as the number of transcripts
787
+ times the ratio of the cell's total reads to total transcripts.
788
+ \textbf {Improve descriptions of how noise is modeled in the observation model. }
789
+
790
+ \textbf {Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix. }
692
791
The analytic solution of the differential equations to predict spliced
693
792
and unspliced gene expression given their parameters is derived by the
694
793
authors of scVelo and a theoretical RNA velocity study
@@ -753,89 +852,6 @@ \subsection{Model formulation}\label{sec-methods-model}
753
852
+\beta _g u_0^{(1)}{ }_g \tau ^{(1)} e^{-\beta _g \tau ^{(1)}}.
754
853
\end {align }
755
854
756
- We use these solutions to formulate an end-to-end probabilistic
757
- generative model that relates prior distributions on kinetic parameters
758
- to a distribution on pairs of observed unspliced and spliced read count
759
- matrices
760
-
761
- \begin {align }
762
- \alpha ^{(0)}{ }_g &\sim \operatorname {LogNormal}(0,1), \\
763
- \beta _g &\sim \operatorname {LogNormal}(0,1), \\
764
- \gamma _g &\sim \operatorname {LogNormal}(0,1), \\
765
- &\hskip -18pt \Delta \text { switching }_g \sim \operatorname {LogNormal}(0,1), \\
766
- t_0^{\left (k_{c g}\right )} &= \left \{
767
- \begin {array }{l}
768
- t_0^{(0)}{ }_g \sim \operatorname {Normal}(0,1), k_{c g}=0 \\
769
- t_0^{(1)}{ }_g=t_0^{(0)}{ }_g+\Delta \text { switching }_g, \\
770
- \quad k_{c g}=1
771
- \end {array }\right . \\
772
- t_c &\sim \operatorname {LogNormal}(0,1), \\
773
- k_{c g} &\sim \text {Bernoulli} \left ( \text {logits}= t_c-t_0^{(1)} \right ), \\
774
- \tau ^{\left (k_{c g}\right )}
775
- &= \operatorname {softplus}\left (t_c-t_0^{\left (k_{c g}\right )}{ }_g\right ), \\
776
- u_{c g}
777
- &= \text { Measurement }_u \left ( u\left (\tau ^{\left (k_{c g}\right )}\right ) ;
778
- u_{c g}^{obs}\right ), \\
779
- s_{c g}
780
- &= \text { Measurement }_s \left ( s\left (\tau ^{(k_{c g})}\right ) ;
781
- s_{c g}^{obs}\right ).
782
- \end {align } \( u\left (\tau ^{\left (k_{c g}\right )}\right )\) and
783
- \( s\left (\tau ^{(k_{c g})}\right )\) are are called the denoised gene
784
- expression calculated from the velocity analytic solution input with the
785
- kinetics random variables. \( u_{cg}\) and \( s_{cg}\) are the spliced and
786
- unspliced read count sampled from the Poisson models. \( u_{cg}^{obs}\)
787
- and \( s_{cg}^{obs}\) are the observed spliced and unspliced read count
788
- tables. The generative process
789
-
790
- \( \text {Measurement}(\cdot )\) for observed unspliced read counts given
791
- denoised unspliced gene transcript expression level
792
- \( u\left (\tau ^{(k_{cg})}\right )\) (and identical for observed spliced
793
- read counts) models the expected number of observed reads for a given
794
- gene in a given cell as the number of transcripts times the ratio of the
795
- cell's total reads to total transcripts \begin {align }
796
- u_c^{\hat {obs}} &= \sum _g u_{c g}^{obs}, \\
797
- \hat {u}_c &= \sum _g u\left ( \tau ^{(k_{c g})}\right ), \\
798
- \eta _c^{(u)} &\sim \operatorname {Normal}\left (
799
- u_c^{\hat {obs}_c},
800
- \operatorname {std} \left (u_c^{\hat {obs}}\right )
801
- \right ), \\
802
- \mu _{c g}^{(u)} &= \log \left (u\left (\tau ^k{ }_{c g}\right )\right )
803
- +\log \left (\eta _c^{(u)}\right )-\log \left (\hat {u}_c\right ), \\
804
- u_{c g}^{obs} &\sim
805
- \operatorname {Poisson}\left (\lambda =\exp \left (\mu _{c g}^{(u)}\right )\right ).
806
- \end {align }
807
-
808
- For the first Pyro-Velocity model (Model 1), we constrain the shared
809
- time to be strictly larger than \( t_{0}^{(0)}\) by introducing auxiliary
810
- random variables \[
811
- \text {t\_ constraint}_{cg}
812
- \sim \text {Bernoulli} \left ( \text {logits} = t_c - {t_{0}^{(0)}}_g \right ),
813
- \] and setting their values to \( 1\) , and we set the initial condition
814
- per gene to be \begin {align }
815
- \left ( {u_{0}^{(k_{cg})}}_g , {s_{0}^{(k_{cg})}}_g \right ) &= \left \{
816
- \begin {array }{l}
817
- (0,0), k_{c g}=0 \\
818
- \bigg ( {u \left ( \Delta \text { switching }_g \right )}_g,\\
819
- \quad {s \left ( \Delta \text { switching }_g \right )}_g \bigg ), \\
820
- \quad k_{c g}=1
821
- \end {array }\right .
822
- \end {align } For the extended Pyro -Velocity model (Model 2), we remove
823
- the shared time constraint \( \text {t\_ constraint}_{cg}\) , thus allowing
824
- a time lag per gene that might be caused by delayed gene activation and
825
- set the initial condition per gene as random variables that are strictly
826
- positive \( \left ({u_{0}^{(0)}}_g,
827
- {s_{0}^{(0)}}_g\right )\) , which allow genes having a basal expression
828
- level before gene activation. Then, we compute the gene expression at
829
- the second switching time as \begin {align }
830
- ({u_{0}^{(1)}}_g, {s_{0}^{(1)}}_g) &=
831
- \bigg ( {u \left ( \Delta \text { switching }_g \right )}_g, \nonumber \\
832
- & \qquad {s \left ( \Delta \text { switching }_g \right )}_g \bigg ),
833
- \end {align } which shares the same initial condition
834
- \( \left ({u_{0}^{(0)}}_g, {s_{0}^{(0)}}_g\right )\) where \begin {align }
835
- {u_{0}^{(0)}}_g &\sim \operatorname {LogNormal}(0,1),\\
836
- {s_{0}^{(0)}}_g &\sim \operatorname {LogNormal}(0,1).
837
- \end {align }
838
-
839
855
\subsection {Variational inference }\label {sec-methods-inference }
840
856
841
857
Given observations
0 commit comments