Skip to content

Commit 114b1b2

Browse files
authored
Merge pull request #1 from Mye-InfoBank/graph
Switch to graph-based symbol resolution
2 parents 0b7280c + d5ca466 commit 114b1b2

40 files changed

+2929
-503
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ build/
55
dist/
66
wheels/
77
*.egg-info
8-
data
8+
./data
99

1010
# Virtual environments
1111
.venv

.vscode/settings.json

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"python.testing.pytestArgs": [
3+
"tests"
4+
],
5+
"python.testing.unittestEnabled": false,
6+
"python.testing.pytestEnabled": true
7+
}

README.md

Lines changed: 109 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# hugo-unifier
22

3-
This python package can unify gene symbols based on the [HUGO database](https://www.genenames.org/tools/multi-symbol-checker/).
3+
This python package can unify gene symbols across datasets based on the [HUGO database](https://www.genenames.org/tools/multi-symbol-checker/).
44

55
## Installation
66

@@ -13,54 +13,134 @@ pip install hugo-unifier
1313
## Usage
1414

1515
The package can be used both as a command line tool and as a library.
16+
It operates in a two-step process:
17+
18+
1. Take the symbols from the input data and create a list of operations to unify them, including a reason for the change
19+
2. Apply the operations to the input data
1620

1721
### Command Line Tool
1822

19-
Currently, the command line tool only supports unifying the entries of a column in an AnnData objects `var` attribute. The input file and column name must be passed as an argument. The tool will update the column in place and save the AnnData object to a new file.
23+
```bash
24+
hugo-unifier get --outdir . test1.h5ad test2.h5ad
25+
```
26+
27+
This will create two files, `test1_changes.csv` and `test2_changes.csv` in the current directory.
28+
These files can be manually inspected to see what changes will be made and what the reasons for each change are.
29+
30+
The command line tool can also be used to apply the changes to the input data:
2031

21-
Check the help message for more information:
2232
```bash
23-
hugo-unifier --help
33+
hugo-unifier apply --input test1.h5ad --changes test1_changes.csv --output test1_unified.h5ad
34+
hugo-unifier apply --input test2.h5ad --changes test2_changes.csv --output test2_unified.h5ad
2435
```
2536

2637
### Library
27-
The package can be used as a library to unify gene symbols in a pandas DataFrame. The `unify` function takes a list of gene symbols and returns a list of unified gene symbols. The function can be used as follows:
38+
39+
Similar to the command line tool, the library can be used to get the changes and apply them to the input data.
2840

2941
```python
30-
from hugo_unifier import unify
31-
gene_symbols = ["TP53", "BRCA1", "EGFR"]
32-
unified_symbols = unify(gene_symbols)
33-
print(unified_symbols)
42+
from hugo_unifier import get_changes, apply_changes
43+
import anndata as ad
44+
45+
adata_test1 = ad.read_h5ad("test1.h5ad")
46+
adata_test2 = ad.read_h5ad("test2.h5ad")
47+
48+
dataset_symbols = {
49+
"test1": adata_test1.var.index.tolist(),
50+
"test2": adata_test2.var.index.tolist(),
51+
}
52+
53+
# Get the changes
54+
G, sample_changes = get_changes(dataset_symbols)
55+
56+
changes_test1 = sample_changes["test1"]
57+
changes_test2 = sample_changes["test2"]
58+
59+
# Apply the changes
60+
adata_test1_unified = apply_changes(adata_test1, changes_test1)
61+
adata_test2_unified = apply_changes(adata_test2, changes_test2)
3462
```
3563

3664
## How it works
3765

38-
Different datasets sometimes use different gene symbols for the same gene. Sometimes, the same gene symbol occurs
39-
with slight modifications, such as dashes, underscores, or other characters. The `hugo-unifier` iteratively applies attempts to manipulate the gene symbols and check them against the HUGO database.
66+
### Step 1: Get HUGO data for symbols while applying manipulations
67+
68+
The first step is to get the HUGO data for the symbols in the input data.
69+
However, sometimes symbols contain artifacts like dots instead of dashes, or numbers following dots indicating a version. As these are mostly not detected in the HUGO database, we try to manipulate the symbols until the HUGO database returns a result.
70+
The manipulations are done in the following order:
71+
72+
1. Keep the symbol as-is
73+
2. Replace dots with dashes
74+
3. Remove everything after the first dot
75+
76+
If one of the manipulations returns a result for a given symbol, we do not try the others for that symbol. Notably, we start with the most conservative approach, keeping the symbol as-is, and only try the other manipulations if that fails.
77+
78+
### Step 2: Build a symbol graph
79+
80+
Different symbols can sometimes have quite complex relationships.
81+
For example, a symbol can be an alias or a previous symbol for multiple other symbols, or a symbol can have multiple aliases or previous symbols. These relationships can be nicely visualized in a graph.
82+
83+
An example for this is shown here:
84+
85+
![Graph example](docs/example.png)
86+
87+
Green nodes are approved symbols, blue ones are not.
88+
89+
The graph is constructed as follows:
90+
1. Add a node for each of the following:
91+
- Original symbols from the input data
92+
- Manipulated symbols that arise within the process
93+
- Symbols returned by the HUGO database
94+
2. Save the datasets that have the symbol within the node with the exact same name
95+
3. Draw edges for the following relationships:
96+
- Manipulations (e.g. dot to dash)
97+
- HUGO relations (Alias, Previous symbol, Approved symbol)
98+
99+
#### Clean the graph
100+
101+
This includes only two steps:
102+
1. Remove self-loops (edges from a node to itself)
103+
2. Remove all nodes that meet the following conditions (and are thus irrelevant for the unification):
104+
- Node has exactly one incoming edge, that originates from an approved symbol
105+
- Node is an approved symbol which is not represented in the input data
106+
107+
### Step 3: Find unification opportunities
108+
109+
Currently, there are two approaches implemented. This can be easily extended in the future.
110+
111+
#### Resolve unapproved symbols
112+
113+
Iterate over all nodes in the graph that represent unapproved symbols and try to find an optimal solution for them. The optimal solution is decided as follows:
114+
115+
1. If the node has only one outgoing edge, the optimal solution is the target of that edge
116+
2. If the node has multiple outgoing edges, we check if the targets of the edges are represented in any datasets. If there is exactly one target that is represented in any datasets, we use that one. If there are multiple, we mark it as a _conflict_ and do not resolve it. If there is none, we do not resolve it either.
117+
118+
Now we have a source and a target node. Based on this, we can check if there is any dataset that has both the symbols in the source and target node. If that is the case, we would potentially loose some information if we would eliminate the source node.
119+
Thus, we do the following:
120+
- If an overlap exists (like the "Devlin" dataset in the following example), copy the symbols that are exclusive to the source node to the target node ![Copy previous symbols](docs/previous-copy.png)
121+
- If no overlap exists, we can safely remove the source node and rename all symbols from the source node to the target node ![Rename alias symbols](docs/dot-to-dash.png)
122+
123+
#### Aggregate approved symbols
40124

41-
The following manipulations are applied in the following order:
42-
1. `identity`: Use the gene symbol as is.
43-
2. `dot-to-dash`: Replace dots with dashes.
44-
3. `discard-after-dot`: Discard everything after the first dot.
125+
This tries to resolve situations where one group of datasets contains one approved symbol, while another group of datasets contains another approved symbol, while one is an alias of the other. The logic is as follows:
45126

46-
More conservative manipulations are applied first. The first manipulation that returns a valid gene symbol is used.
127+
1. Iterate all nodes representing approved symbols
128+
2. Get all predecessors of the node
129+
3. Get the union of the represented datasets of all predecessors and the node itself
130+
4. Get the maximum number of datasets that are represented by any single predecessor or the node itself
131+
5. Calculate the improvement ratio as the union size divided by the maximum size
132+
6. If the improvement ratio is greater than 1.5, copy the symbols from all predecessors to the node
47133

48-
### Resolution of aliases
134+
In the example below, the STRA13 gene would be copied to CENPX for all samples that have CENPX but not STRA13. This is because the union is 9 and the largest number of datasets in a single one of the two nodes is 6 in CENPX. The improvement ratio is exactly 1.5, so the copy is done.
49135

50-
When resolving aliases, the following steps are applied:
136+
![Aggregation of approved symbols](docs/approved-aggregation.png)
51137

52-
1. **Remove Conflicting Aliases**:
53-
Aliases that conflict with already approved symbols are removed. For example, if an alias maps to a symbol that is already approved, it is discarded to avoid conflicts.
138+
### Step 4: Provide change dataframe
54139

55-
2. **Correct Same Aliases**:
56-
If an alias maps to the same symbol as its original symbol, it is corrected and marked as an approved symbol. This ensures that aliases that are effectively the same as the original symbol are treated as valid.
140+
All changes that are made to the graph are also stored in form of a dataframe, that is made available to the user for inspection. Before the dataframe is returned, it is split into smaller per-dataset dataframes.
57141

58-
3. **Handle Duplicate Aliases**:
59-
If multiple aliases map to the same original symbol:
60-
- By default, only one alias is retained, and the rest are discarded.
61-
- If the `keep_gene_multiple_aliases` option is enabled, all aliases are retained, and an identity mapping is created for the duplicates.
142+
If `hugo-unifier` is used via CLI, these dataframes are saved to the output directory. If `hugo-unifier` is used via the library, the dataframes are returned as a dictionary with the dataset names as keys and the dataframes as values.
62143

63-
4. **Unaccepted Aliases**:
64-
Any aliases that cannot be resolved or conflict with the above rules are marked as unaccepted and excluded from the final results.
144+
### Step 5: Apply changes to the input data
65145

66-
These steps ensure that aliases are resolved in a consistent and conflict-free manner, prioritizing approved symbols and avoiding ambiguity in the mapping process.
146+
The content of a single-dataset change dataframe is applied to the corresponding input dataset. Basically all the change entries are applied one-by-one to the input dataset, in the same order as they were detected in the graph unification process.

docs/alias-rename.png

15.4 KB
Loading

docs/approved-aggregation.png

49.7 KB
Loading

docs/dot-to-dash.png

20.4 KB
Loading

docs/example.png

55.7 KB
Loading

docs/previous-copy.png

22.3 KB
Loading

pyproject.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
[project]
22
name = "hugo-unifier"
3-
version = "0.1.2"
3+
version = "0.2.0"
44
description = "Add your description here"
55
readme = "README.md"
66
authors = [
@@ -9,6 +9,7 @@ authors = [
99
requires-python = ">=3.12"
1010
dependencies = [
1111
"anndata>=0.11.4",
12+
"networkx>=3.4.2",
1213
"requests>=2.32.3",
1314
"rich-click>=1.8.8",
1415
]
@@ -22,6 +23,7 @@ build-backend = "hatchling.build"
2223

2324
[dependency-groups]
2425
dev = [
26+
"ipykernel>=6.29.5",
2527
"pytest>=8.3.5",
2628
"ruff>=0.11.4",
2729
"scanpy>=1.11.1",

src/hugo_unifier/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1-
from hugo_unifier.unify import unify
1+
from hugo_unifier.get_changes import get_changes
2+
from hugo_unifier.apply_changes import apply_changes
23

3-
__all__ = ["unify"]
4+
__all__ = ["get_changes", "apply_changes"]

0 commit comments

Comments
 (0)