You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-$\hat{X}_p$: The observed gene expression measurements in cells with perturbation $p$
59
-
-$\mathcal{D}_{\text{basal}}$: The distribution of the unperturbed, baseline cell population.
60
-
-$\hat{T}_p(\mathcal{D}_{\text{basal}})$: True effect caused by perturbation $p$ on the population.
61
-
-$H(\mathcal{D}_{\text{basal}})$: Biological heterogeneity of the baseline population.
62
-
-$\varepsilon$: Experiment-specific technical noise, assumed independent of the unperturbed cell state and $\mathcal{D}_{\text{basal}}$.
58
+
-\\(\hat{X}_p\\): The observed gene expression measurements in cells with perturbation \\(p\\)
59
+
-\\(\mathcal{D}_{\text{basal}}\\): The distribution of the unperturbed, baseline cell population.
60
+
-\\(\hat{T}_p(\mathcal{D}_{\text{basal}})\\): True effect caused by perturbation \\(p\\) on the population.
61
+
-\\(H(\mathcal{D}_{\text{basal}})\\): Biological heterogeneity of the baseline population.
62
+
-\\(\varepsilon\\): Experiment-specific technical noise, assumed independent of the unperturbed cell state and \\(\mathcal{D}_{\text{basal}}\\).
63
63
64
64
# STATE: The baseline from Arc
65
65
@@ -108,7 +108,7 @@ A gene consists of _exons_ (protein coding sections) and _introns_ (non-protein
108
108
With this basic understanding, we can move on to how the SE model works. Remember, our core goal for SE is to create **meaningful
109
109
cell embeddings**. To do this, we must first create meaningful gene embeddings.
110
110
111
-
To produce a single gene embedding, we first obtain the amino acid sequence (e.g $\texttt{SDKPDMAEI}$... for TMSB4X) of all the different protein isoforms encoded for by the gene in question. We then feed these sequences to [ESM2](https://huggingface.co/facebook/esm2_t48_15B_UR50D), a 15B parameter Protein Language Model from FAIR. ESM produces an embedding _per amino acid_, and we mean pool them together to obtain a "transcript" (a.k.a protein isoform) embedding.
111
+
To produce a single gene embedding, we first obtain the amino acid sequence (e.g \\(\texttt{SDKPDMAEI}\\)... for TMSB4X) of all the different protein isoforms encoded for by the gene in question. We then feed these sequences to [ESM2](https://huggingface.co/facebook/esm2_t48_15B_UR50D), a 15B parameter Protein Language Model from FAIR. ESM produces an embedding _per amino acid_, and we mean pool them together to obtain a "transcript" (a.k.a protein isoform) embedding.
112
112
113
113
Now we have all of these protein isoform embeddings, we then just mean pool those to get the gene embedding. Next, we project these gene embeddings to our model dimension using a learned encoder as follows:
We add a $\texttt{[CLS]}$ token and $\texttt{[DS]}$ token to our sentence. The $\texttt{[CLS]}$ token ends up being used as our "cell embedding" (very BERT-like)
128
-
and the $\texttt{[DS]}$ token is used to "disentangle dataset-specific effects". Although the genes are sorted by log fold
127
+
We add a \\(\texttt{[CLS]}\\) token and \\(\texttt{[DS]}\\) token to our sentence. The \\(\texttt{[CLS]}\\) token ends up being used as our "cell embedding" (very BERT-like)
128
+
and the \\(\texttt{[DS]}\\) token is used to "disentangle dataset-specific effects". Although the genes are sorted by log fold
129
129
expression level, Arc further enforces the magnitude of each genes expression by incorporating the transcriptome in a
130
130
fashion analogous to positional embeddings. Through an odd ["soft binning" algorithm](https://github.com/ArcInstitute/state/blob/main/src/state/emb/nn/model.py#L374) and 2 MLPs, they create some
131
131
"expression encodings" which they then add to each gene embedding. This should modulate the magnitude of each gene
@@ -151,7 +151,7 @@ Understanding how your submission will be evaluated is key to success. The 3 eva
151
151
152
152
Perturbation Discrimination intends to evaluate how well your model can uncover _relative differences_ between
153
153
perturbations. To do this, we compute the Manhattan distances for all the measured perturbed transcriptomes in the test set (the ground
154
-
truth we are trying to predict, $y_t$ and all other perturbed transcriptomes, $y_p^n$) to our predicted transcriptome $\hat{y}_t$. We then rank where the
154
+
truth we are trying to predict, \\(y_t\\) and all other perturbed transcriptomes, \\(y_p^n\\)) to our predicted transcriptome \\(\hat{y}_t\\). We then rank where the
155
155
ground truth lands with respect to all transcriptomes as follows:
156
156
157
157
$$
@@ -164,7 +164,7 @@ $$
164
164
\text{PDisc}_t = \frac{r_t}{T}
165
165
$$
166
166
167
-
Where $0$ would be a perfect match. The overall score for your predictions is the mean of all $$\text{PDisc}_t$$. This is then normalized to:
167
+
Where \\(0\\) would be a perfect match. The overall score for your predictions is the mean of all $$\text{PDisc}_t$$. This is then normalized to:
168
168
169
169
$$
170
170
\text{PDiscNorm} = 1 - 2\text{PDisc}
@@ -174,17 +174,17 @@ We multiply by 2 as for a random prediction, ~half of the results would be close
174
174
175
175
## Differential Expression
176
176
177
-
Differential Expression intends to evaluate what fraction of the truly affected genes did you correctly identify as significantly affected. Firstly, for each gene compute a $p$-value using a [Wilcoxon rank-sum test with tie correction](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test). We do this for both our predicted perturbation distribution and the ground truth perturbation distribution.
177
+
Differential Expression intends to evaluate what fraction of the truly affected genes did you correctly identify as significantly affected. Firstly, for each gene compute a \\(p\\)-value using a [Wilcoxon rank-sum test with tie correction](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test). We do this for both our predicted perturbation distribution and the ground truth perturbation distribution.
178
178
179
-
Next, we apply the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure), basically some stats to modulate the $p$-values, as with $20,000$ genes and a $p$-value threshold of $0.05$, you'd expect $1,000$ false positives. We denote our set of predicted differentially expressed genes $G_{p,pred}$, and the ground truth set of differentially expressed genes $G_{p,true}$.
179
+
Next, we apply the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure), basically some stats to modulate the \\(p\\)-values, as with \\(20,000\\) genes and a \\(p\\)-value threshold of \\(0.05\\), you'd expect \\(1,000\\) false positives. We denote our set of predicted differentially expressed genes \\(G_{p,pred}\\), and the ground truth set of differentially expressed genes \\(G_{p,true}\\).
180
180
181
181
If the size of our set is less than the ground truth set size, take the intersection of the sets, and divide by the true number of differentially expressed genes as follows:
If the size of our set is greater than the ground truth set size, select the subset we predict are most differentially expressed (our "most confident" predictions, denoted $\tilde{G}_{p,pred}$), take the intersection with the ground truth set, and then divide by the true number.
187
+
If the size of our set is greater than the ground truth set size, select the subset we predict are most differentially expressed (our "most confident" predictions, denoted \\(\tilde{G}_{p,pred}\\)), take the intersection with the ground truth set, and then divide by the true number.
0 commit comments