This is the code repo for the paper "HypKG: Hypergraph-based Knowledge Graph Contextualization for Precision Healthcare". It aims to preprocess and generate embeddings, run baseline experiments, and test contextualization methods to improve KG-based insights.
Knowledge Graphs (KGs) that store general factual information often lack the ability to account for important contextual details, such as the status of specific patients, which are crucial for precision healthcare. In this paper, we propose HypKG, a framework that integrates patient information, such as data from Electronic Health Records (EHRs), into KGs to generate contextualized knowledge representations for precision healthcare.
A toy example of KG contexualization. Left: traditional KG. Right: our proposed contexualized KG.
HypKG is a framework designed to contextualize KG representation with patient-specific context for precision healthcare, and the pipeline is described below. First, HypKG connects KG entities with relevant context information from EHR by linking medical entities between them. Then, HypKG jointly represents KG knowledge and contextual information from EHR in a hypergraph structure, capturing key relationships between patients, features, and other patients. Finally, the node and hyperedge embeddings in the hypergraph structure are learned and optimized for downstream precision healthcare tasks.
Overview of our proposed HypKG framework.
-
KGEmbedding: Contains scripts for generating KG embeddings.
- gen_embedding: Includes embedding generation scripts (e.g.,
TransE.py
,complEx.py
,compGCN.py
). - preprocess: Preprocessing scripts for data preparation.
- read_results: Scripts to read and format embedding results.
- gen_embedding: Includes embedding generation scripts (e.g.,
-
Baselines: Contains code and data for baseline experiments.
- data: Datasets used in the baseline experiments.
- src: Source code for dataset conversion, preprocessing, and model training.
- scripts: Shell scripts to facilitate experiments (e.g.,
mimic.sh
).
-
Contextualization: Contains data and source code for HypKG's contextualization experiments.
- data: Data used specifically in contextualization experiments.
- src: Code to preprocess data, define models, and execute training.
- scripts: Scripts for running specific contextualization tasks (e.g.,
mimic_run.sh
,promote.sh
).
-
Docs: Documentation files for detailed project explanations and usage guides.
We use a large-scale public knowledge graph, iBKH (from https://github.com/wcm-wanglab/iBKH), as the primary KG dataset. As for the patient context information, we contextualize the knowledge graph by integrating patient-specific data from two EHR datasets: MIMIC-III (from https://physionet.org/content/mimiciii/1.4/) and PROMOTE (private dataset). Due to the sensitive nature of medical data and privacy considerations, there are restrictions on data sharing. To gain access to the two patient-specific datasets, appropriate training and credentials may be required (https://physionet.org/). For further assistance with data access or other related inquiries, please feel free to reach out to our author team.
-
Navigate to Embedding Scripts:
- Go to
KGEmbedding/gen_embedding
to find implementation scripts for various embedding models:- TransE:
TransE.py
- ComplEx:
complEx.py
- CompGCN:
compGCN.py
- TransE:
- Each script contains methods to configure and train models on knowledge graph data.
- Go to
-
Data Preprocessing:
- Preprocess data for embedding by running scripts in
KGEmbedding/preprocess
in the intended order. For example:- Start with
0_ConvertToHRT.py
to format datasets correctly.
- Start with
- Preprocess data for embedding by running scripts in
-
Save and Analyze Embeddings:
- Use utilities in
KGEmbedding/read_results
to save and analyze generated embeddings.
- Use utilities in
- Code:
- Please check code from the PromptLink repo: https://github.com/constantjxyz/PromptLink.
-
Organize data in specific form:
- Input data for the contextualization model should be organized in specific formats. Please refer to the
Contextualization/data/raw_data
folder as a reference. Note that the raw data in this folder is demonstration data intended to illustrate the required format and is not suitable for training purposes.
- Input data for the contextualization model should be organized in specific formats. Please refer to the
-
Define and Train Models:
- In the
Contextualization/src
folder, find scripts for defining contextualized models, running training, and managing additional preprocessing steps.
- In the
-
Run Training:
- Run
Contextualization/src/train.py
to initiate model training. - Customize parameters for layers, model architecture, and dataset paths within
Contextualization/src/models.py
. - Use scripts in
Contextualization/src/scripts/
(e.g.,mimic_run.sh
) for additional model runs, adapted to specific contexts such as MIMIC data.
- Run
-
Organize data in specific form:
- Input data for the contextualization model should be organized in specific formats. Please refer to the
Baselines/data/raw_data
folder as a reference. Note that the raw data in this folder is demonstration data intended to illustrate the required format and is not suitable for training purposes.
- Input data for the contextualization model should be organized in specific formats. Please refer to the
-
Train Baseline Models:
- Execute
Baselines/src/train.py
to train baseline models. - Configure model settings and dataset paths as needed within the scripts to fit your specific dataset and model requirements.
- Use scripts in
Baselines/src/scripts/
(e.g.,mimic.sh
) for additional model runs, adapted to specific contexts such as MIMIC data.
- Execute
Each component in the usage pipeline can be adapted to specific datasets or knowledge graphs by editing configuration settings within each folder.
- Python 3.11.5
- Required libraries listed in
requirements.txt
.
We would like to thank the authors from AllSet (https://github.com/jianhao2016/AllSet), PromptLink (https://github.com/constantjxyz/PromptLink), Pykeen (https://github.com/pykeen/pykeen), iBKH (https://github.com/wcm-wanglab/iBKH) and SAPBERT (https://github.com/cambridgeltl/sapbert) for their open-source efforts.