This project is a healthcare AI model built using Python and scikit-learn to predict patient health risk levels (Low, Moderate, High) based on demographic, socioeconomic, and medical history data.
The goal of this project is to explore machine learning in the healthcare space by building a classification model that predicts a patient's health risk score. The model can help identify high-risk individuals and support early intervention strategies.
The dataset includes synthetic healthcare data with the following features:
- Age
- Gender
- Ethnicity
- Income Level
- Employment Status
- Medical History
- Health Risk Score (Target)
Additional sample records were created to balance the dataset and improve classification performance.
- Python
- Pandas for data processing
- scikit-learn for model training and evaluation
- RandomForestClassifier for classification
- LabelEncoder for categorical feature encoding
- Google Colab for development and execution
- Data cleaning and encoding of categorical features
- Class filtering to remove underrepresented health risk classes
- Splitting data into training and testing sets (70/30)
- Training a Random Forest model
- Evaluating model performance with:
- Confusion matrix
- Classification report (accuracy, precision, recall)
- Exporting the trained model using
joblib
The initial model achieved ~62% accuracy on the test data. Performance can be further improved by:
- Hyperparameter tuning
- Adding more diverse training data
- Exploring alternative classification algorithms
- Implement cross-validation
- Tune hyperparameters using GridSearchCV
- Add more advanced visualizations (e.g., SHAP for feature importance)
- Improve dataset size and balance for better generalization
- Trained model saved as
random_forest_model.joblib
- Notebook includes full pipeline from preprocessing to evaluation
This notebook was developed in Google Colab and can be adapted for use with any structured healthcare dataset.
Project by Patricia L Johnson