A scalable machine learning pipeline using Apache Spark to classify the risk of heart disease based on medical attributes. This project leverages distributed computing for handling large datasets efficiently and includes robust preprocessing, feature engineering, model training, and evaluation.
This project implements a Random Forest Classifier to predict the risk level of heart disease using the UCI Heart Disease dataset. It uses PySpark MLlib to handle large-scale data processing and build a production-ready pipeline.
- Apache Spark ML pipeline for end-to-end processing
- Handling missing values with imputation
- Encoding categorical features
- Feature scaling and vector assembly
- Weighted classification to address class imbalance
- Model training using Random Forest Classifier
- Model evaluation with confusion matrix and metrics
- Random Forest Classifier
- StringIndexer
- Imputer (mean strategy)
- StandardScaler
- MulticlassClassificationEvaluator
- Python 🐍
- Apache Spark ⚡
- PySpark MLlib
- Pandas, Seaborn, Matplotlib (for visualization)
- Source: UCI Machine Learning Repository - Heart Disease Dataset
- Format: CSV
- Make sure the file
heart_disease_uci.csv
is present in the same directory.
- Start Spark Session
- Load Dataset
- Drop Unnecessary Columns
- Fix Data Types
- Handle Class Imbalance (Weighting)
- Impute Missing Values
- Encode Categorical Variables
- Assemble Features into Vector
- Train/Test Split
- Train the Random Forest Classifier
- Evaluate the Model
- Accuracy
- Precision / Recall
- F1-score
- Confusion Matrix (Plotted using Matplotlib)
- Install dependencies:
pip install pyspark pandas seaborn matplotlib
- Run the notebook:
jupyter notebook SparkML_Heart_Risk_Classifier.ipynb
- Make sure
heart_disease_uci.csv
is in the same directory as the notebook.
- Confusion matrix visualization
- Class weights applied
- Schema changes after preprocessing
- Final model performance scores
- Add support for multiple classifiers (e.g. Logistic Regression, GBT)
- Automate hyperparameter tuning
- Deploy model using Flask or Streamlit
- Save model using Spark ML persistence
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License.
Created by Sayyed Hossein Hosseini DolatAbadi