This project is part of the DevelopersHub Internship (AI/ML).
The task for Week 1 was to learn the basic ML workflow by predicting heart disease using patient data from the UCI Cleveland Heart Disease dataset.
WEEK1-disease-prediction/
β
βββ data/
β βββ cleveland.csv # Dataset (renamed from processed.cleveland.data)
β
βββ notebooks/ # Jupyter notebooks for each step
β βββ 01_load_and_explore.ipynb # Step 1: Load & Explore dataset
β βββ 02_preprocessing.ipynb # Step 2: Preprocessing (imputation, scaling, binary target)
β βββ 03_eda.ipynb # Step 3: Exploratory Data Analysis (EDA)
β βββ 04_model_training.ipynb # Step 4: Model Training (Logistic Regression & Random Forest)
β βββ 05_evaluation_report.ipynb # Step 5: One-page summary notebook
β
βββ week1_report.md # Short 1-page Markdown report
βββ week1_report.pdf # Exported PDF report (for submission)
βββ README.md # Project documentation (this file)
---
## π Dataset
- **Source:** UCI Machine Learning Repository (Cleveland subset, processed version)
- **Size:** 303 rows Γ 14 columns
- **Target:** `target` (0β4) β binarized to `target_bin` (0 = healthy, 1 = disease)
---
## βοΈ Preprocessing
- Missing values:
- `ca`, `thal` β filled with **mode** (most frequent value)
- other numeric columns β filled with **median**
- Features scaled to **[0, 1]** using `MinMaxScaler`
- Final dataset: **13 features + 1 binary target**
---
## π Exploratory Data Analysis (EDA)
- Class balance: ~54% healthy, ~46% disease
- Feature distributions plotted (histograms)
- Correlation heatmap to study feature relationships
---
## π€ Models & Results
Two models were trained and evaluated:
| Model | Accuracy |
|-----------------------|----------|
| Logistic Regression | **0.8525** |
| Random Forest | **0.9016** β
|
**Selected Model:** Random Forest (better accuracy)
---
## π Outcome
- Learned a complete **ML workflow**:
- Data loading β preprocessing β EDA β model training β evaluation
- Produced a **1-page report** (`week1_report.pdf`) for submission
- Random Forest performed best and is the selected baseline model.
---