Skip to content

Commit 1605491

Browse files
committed
chore: fix inline math expr
1 parent d857fe2 commit 1605491

File tree

2 files changed

+24
-14
lines changed

2 files changed

+24
-14
lines changed

_blog.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6383,3 +6383,13 @@
63836383
- research
63846384
- evaluation
63856385
- ai
6386+
6387+
6388+
- local: virtual-cell-challenge
6389+
title: "Arc Virtual Cell Challenge: A Primer"
6390+
thumbnail: /blog/assets/virtual-cell-challenge/thumbnail.png
6391+
author: FL33TW00D-HF
6392+
date: July 18, 2025
6393+
tags:
6394+
- collaboration
6395+
- guide

virtual-cell-challenge.md

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
title: "Arc Virtual Cell Challenge: A Primer"
3-
thumbnail: /blog/assets/arc-virtual-cell-challenge/thumbnail.png
3+
thumbnail: /blog/assets/virtual-cell-challenge/thumbnail.png
44
authors:
55
- user: FL33TW00D-HF
66
---
@@ -55,11 +55,11 @@ $$
5555
$$
5656

5757
where:
58-
- $\hat{X}_p$: The observed gene expression measurements in cells with perturbation $p$
59-
- $\mathcal{D}_{\text{basal}}$: The distribution of the unperturbed, baseline cell population.
60-
- $\hat{T}_p(\mathcal{D}_{\text{basal}})$: True effect caused by perturbation $p$ on the population.
61-
- $H(\mathcal{D}_{\text{basal}})$: Biological heterogeneity of the baseline population.
62-
- $\varepsilon$: Experiment-specific technical noise, assumed independent of the unperturbed cell state and $\mathcal{D}_{\text{basal}}$.
58+
- \\(\hat{X}_p\\): The observed gene expression measurements in cells with perturbation \\(p\\)
59+
- \\(\mathcal{D}_{\text{basal}}\\): The distribution of the unperturbed, baseline cell population.
60+
- \\(\hat{T}_p(\mathcal{D}_{\text{basal}})\\): True effect caused by perturbation \\(p\\) on the population.
61+
- \\(H(\mathcal{D}_{\text{basal}})\\): Biological heterogeneity of the baseline population.
62+
- \\(\varepsilon\\): Experiment-specific technical noise, assumed independent of the unperturbed cell state and \\(\mathcal{D}_{\text{basal}}\\).
6363

6464
# STATE: The baseline from Arc
6565

@@ -108,7 +108,7 @@ A gene consists of _exons_ (protein coding sections) and _introns_ (non-protein
108108
With this basic understanding, we can move on to how the SE model works. Remember, our core goal for SE is to create **meaningful
109109
cell embeddings**. To do this, we must first create meaningful gene embeddings.
110110

111-
To produce a single gene embedding, we first obtain the amino acid sequence (e.g $\texttt{SDKPDMAEI}$... for TMSB4X) of all the different protein isoforms encoded for by the gene in question. We then feed these sequences to [ESM2](https://huggingface.co/facebook/esm2_t48_15B_UR50D), a 15B parameter Protein Language Model from FAIR. ESM produces an embedding _per amino acid_, and we mean pool them together to obtain a "transcript" (a.k.a protein isoform) embedding.
111+
To produce a single gene embedding, we first obtain the amino acid sequence (e.g \\(\texttt{SDKPDMAEI}\\)... for TMSB4X) of all the different protein isoforms encoded for by the gene in question. We then feed these sequences to [ESM2](https://huggingface.co/facebook/esm2_t48_15B_UR50D), a 15B parameter Protein Language Model from FAIR. ESM produces an embedding _per amino acid_, and we mean pool them together to obtain a "transcript" (a.k.a protein isoform) embedding.
112112

113113
Now we have all of these protein isoform embeddings, we then just mean pool those to get the gene embedding. Next, we project these gene embeddings to our model dimension using a learned encoder as follows:
114114

@@ -124,8 +124,8 @@ $$
124124
\tilde{\mathbf{c}}^{(i)} = \left[\mathbf{z}_{\text{cls}}, \tilde{\mathbf{g}}_1^{(i)}, \tilde{\mathbf{g}}_2^{(i)}, \ldots, \tilde{\mathbf{g}}_L^{(i)}, \mathbf{z}_{\text{ds}}\right] \in \mathbb{R}^{(L+2) \times h}
125125
$$
126126

127-
We add a $\texttt{[CLS]}$ token and $\texttt{[DS]}$ token to our sentence. The $\texttt{[CLS]}$ token ends up being used as our "cell embedding" (very BERT-like)
128-
and the $\texttt{[DS]}$ token is used to "disentangle dataset-specific effects". Although the genes are sorted by log fold
127+
We add a \\(\texttt{[CLS]}\\) token and \\(\texttt{[DS]}\\) token to our sentence. The \\(\texttt{[CLS]}\\) token ends up being used as our "cell embedding" (very BERT-like)
128+
and the \\(\texttt{[DS]}\\) token is used to "disentangle dataset-specific effects". Although the genes are sorted by log fold
129129
expression level, Arc further enforces the magnitude of each genes expression by incorporating the transcriptome in a
130130
fashion analogous to positional embeddings. Through an odd ["soft binning" algorithm](https://github.com/ArcInstitute/state/blob/main/src/state/emb/nn/model.py#L374) and 2 MLPs, they create some
131131
"expression encodings" which they then add to each gene embedding. This should modulate the magnitude of each gene
@@ -151,7 +151,7 @@ Understanding how your submission will be evaluated is key to success. The 3 eva
151151

152152
Perturbation Discrimination intends to evaluate how well your model can uncover _relative differences_ between
153153
perturbations. To do this, we compute the Manhattan distances for all the measured perturbed transcriptomes in the test set (the ground
154-
truth we are trying to predict, $y_t$ and all other perturbed transcriptomes, $y_p^n$) to our predicted transcriptome $\hat{y}_t$. We then rank where the
154+
truth we are trying to predict, \\(y_t\\) and all other perturbed transcriptomes, \\(y_p^n\\)) to our predicted transcriptome \\(\hat{y}_t\\). We then rank where the
155155
ground truth lands with respect to all transcriptomes as follows:
156156

157157
$$
@@ -164,7 +164,7 @@ $$
164164
\text{PDisc}_t = \frac{r_t}{T}
165165
$$
166166

167-
Where $0$ would be a perfect match. The overall score for your predictions is the mean of all $$\text{PDisc}_t$$. This is then normalized to:
167+
Where \\(0\\) would be a perfect match. The overall score for your predictions is the mean of all $$\text{PDisc}_t$$. This is then normalized to:
168168

169169
$$
170170
\text{PDiscNorm} = 1 - 2\text{PDisc}
@@ -174,17 +174,17 @@ We multiply by 2 as for a random prediction, ~half of the results would be close
174174

175175
## Differential Expression
176176

177-
Differential Expression intends to evaluate what fraction of the truly affected genes did you correctly identify as significantly affected. Firstly, for each gene compute a $p$-value using a [Wilcoxon rank-sum test with tie correction](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test). We do this for both our predicted perturbation distribution and the ground truth perturbation distribution.
177+
Differential Expression intends to evaluate what fraction of the truly affected genes did you correctly identify as significantly affected. Firstly, for each gene compute a \\(p\\)-value using a [Wilcoxon rank-sum test with tie correction](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test). We do this for both our predicted perturbation distribution and the ground truth perturbation distribution.
178178

179-
Next, we apply the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure), basically some stats to modulate the $p$-values, as with $20,000$ genes and a $p$-value threshold of $0.05$, you'd expect $1,000$ false positives. We denote our set of predicted differentially expressed genes $G_{p,pred}$, and the ground truth set of differentially expressed genes $G_{p,true}$.
179+
Next, we apply the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini%E2%80%93Hochberg_procedure), basically some stats to modulate the \\(p\\)-values, as with \\(20,000\\) genes and a \\(p\\)-value threshold of \\(0.05\\), you'd expect \\(1,000\\) false positives. We denote our set of predicted differentially expressed genes \\(G_{p,pred}\\), and the ground truth set of differentially expressed genes \\(G_{p,true}\\).
180180

181181
If the size of our set is less than the ground truth set size, take the intersection of the sets, and divide by the true number of differentially expressed genes as follows:
182182

183183
$$
184184
DE_p = \frac{G_{p,pred} \cap G_{p,true}}{n_{p,true}}
185185
$$
186186

187-
If the size of our set is greater than the ground truth set size, select the subset we predict are most differentially expressed (our "most confident" predictions, denoted $\tilde{G}_{p,pred}$), take the intersection with the ground truth set, and then divide by the true number.
187+
If the size of our set is greater than the ground truth set size, select the subset we predict are most differentially expressed (our "most confident" predictions, denoted \\(\tilde{G}_{p,pred}\\)), take the intersection with the ground truth set, and then divide by the true number.
188188

189189
$$
190190
DE_p = \frac{\tilde{G}_{p,pred} \cap G_{p,true}}{n_{p,true}}

0 commit comments

Comments
 (0)