❤️ SparkML Heart Risk Classifier

A scalable machine learning pipeline using Apache Spark to classify the risk of heart disease based on medical attributes. This project leverages distributed computing for handling large datasets efficiently and includes robust preprocessing, feature engineering, model training, and evaluation.

📊 Project Overview

This project implements a Random Forest Classifier to predict the risk level of heart disease using the UCI Heart Disease dataset. It uses PySpark MLlib to handle large-scale data processing and build a production-ready pipeline.

🚀 Features

Apache Spark ML pipeline for end-to-end processing
Handling missing values with imputation
Encoding categorical features
Feature scaling and vector assembly
Weighted classification to address class imbalance
Model training using Random Forest Classifier
Model evaluation with confusion matrix and metrics

🧠 Algorithms Used

Random Forest Classifier
StringIndexer
Imputer (mean strategy)
StandardScaler
MulticlassClassificationEvaluator

🛠️ Technologies

Python 🐍
Apache Spark ⚡
PySpark MLlib
Pandas, Seaborn, Matplotlib (for visualization)

📂 Dataset

Source: UCI Machine Learning Repository - Heart Disease Dataset
Format: CSV
Make sure the file heart_disease_uci.csv is present in the same directory.

🔄 Workflow

Start Spark Session
Load Dataset
Drop Unnecessary Columns
Fix Data Types
Handle Class Imbalance (Weighting)
Impute Missing Values
Encode Categorical Variables
Assemble Features into Vector
Train/Test Split
Train the Random Forest Classifier
Evaluate the Model

📉 Model Evaluation

Accuracy
Precision / Recall
F1-score
Confusion Matrix (Plotted using Matplotlib)

🧪 How to Run

Install dependencies:

pip install pyspark pandas seaborn matplotlib

Run the notebook:

jupyter notebook SparkML_Heart_Risk_Classifier.ipynb

Make sure heart_disease_uci.csv is in the same directory as the notebook.

🧾 Sample Output

Confusion matrix visualization
Class weights applied
Schema changes after preprocessing
Final model performance scores

✅ Future Improvements

Add support for multiple classifiers (e.g. Logistic Regression, GBT)
Automate hyperparameter tuning
Deploy model using Flask or Streamlit
Save model using Spark ML persistence

🤝 Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

📄 License

This project is licensed under the MIT License.

📬 Contact

Created by Sayyed Hossein Hosseini DolatAbadi

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Dataset		Dataset
Documentation		Documentation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SparkML_Heart_Risk_Classifier.ipynb		SparkML_Heart_Risk_Classifier.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

❤️ SparkML Heart Risk Classifier

📊 Project Overview

🚀 Features

🧠 Algorithms Used

🛠️ Technologies

📂 Dataset

🔄 Workflow

📉 Model Evaluation

🧪 How to Run

🧾 Sample Output

✅ Future Improvements

🤝 Contributing

📄 License

📬 Contact

About

Uh oh!

Releases 1

Packages

Languages

License

Sayed-Hossein-Hosseini/SparkML_Heart_Risk_Classifier

Folders and files

Latest commit

History

Repository files navigation

❤️ SparkML Heart Risk Classifier

📊 Project Overview

🚀 Features

🧠 Algorithms Used

🛠️ Technologies

📂 Dataset

🔄 Workflow

📉 Model Evaluation

🧪 How to Run

🧾 Sample Output

✅ Future Improvements

🤝 Contributing

📄 License

📬 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages