Skip to content

This project is a healthcare AI model built using Python and scikit-learn to predict patient health risk levels (Low, Moderate, High) based on demographic, socioeconomic, and medical history data.

License

Notifications You must be signed in to change notification settings

PatriciaLJohnson/Healthcare-Risk-Prediction-with-Random-Forest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Healthcare Risk Prediction with Random Forest

This project is a healthcare AI model built using Python and scikit-learn to predict patient health risk levels (Low, Moderate, High) based on demographic, socioeconomic, and medical history data.

πŸ’‘ Project Goal

The goal of this project is to explore machine learning in the healthcare space by building a classification model that predicts a patient's health risk score. The model can help identify high-risk individuals and support early intervention strategies.

πŸ“Š Dataset

The dataset includes synthetic healthcare data with the following features:

  • Age
  • Gender
  • Ethnicity
  • Income Level
  • Employment Status
  • Medical History
  • Health Risk Score (Target)

Additional sample records were created to balance the dataset and improve classification performance.

πŸ›  Tools & Technologies

  • Python
  • Pandas for data processing
  • scikit-learn for model training and evaluation
  • RandomForestClassifier for classification
  • LabelEncoder for categorical feature encoding
  • Google Colab for development and execution

πŸ“ˆ Key Steps

  • Data cleaning and encoding of categorical features
  • Class filtering to remove underrepresented health risk classes
  • Splitting data into training and testing sets (70/30)
  • Training a Random Forest model
  • Evaluating model performance with:
    • Confusion matrix
    • Classification report (accuracy, precision, recall)
  • Exporting the trained model using joblib

βš™οΈ Model Performance

The initial model achieved ~62% accuracy on the test data. Performance can be further improved by:

  • Hyperparameter tuning
  • Adding more diverse training data
  • Exploring alternative classification algorithms

πŸš€ Future Improvements

  • Implement cross-validation
  • Tune hyperparameters using GridSearchCV
  • Add more advanced visualizations (e.g., SHAP for feature importance)
  • Improve dataset size and balance for better generalization

πŸ“‚ Output

  • Trained model saved as random_forest_model.joblib
  • Notebook includes full pipeline from preprocessing to evaluation

πŸ”— Try It Out

This notebook was developed in Google Colab and can be adapted for use with any structured healthcare dataset.


Project by Patricia L Johnson

About

This project is a healthcare AI model built using Python and scikit-learn to predict patient health risk levels (Low, Moderate, High) based on demographic, socioeconomic, and medical history data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published