Skip to content

Week 1 of my AI/ML Internship at DevelopersHub πŸš€ β€” built a disease prediction model using patient data. Explored the UCI Cleveland dataset, handled missing values, ran EDA, and compared Logistic Regression vs Random Forest. Random Forest achieved 90.16% accuracy βœ…

Notifications You must be signed in to change notification settings

A-iftikhar02/-Disease-Prediction-Using-Patient-Data

Repository files navigation

🩺 Week 1 β€” Disease Prediction Using Patient Data

This project is part of the DevelopersHub Internship (AI/ML).
The task for Week 1 was to learn the basic ML workflow by predicting heart disease using patient data from the UCI Cleveland Heart Disease dataset.


πŸ“‚ Project Structure

WEEK1-disease-prediction/
β”‚
β”œβ”€β”€ data/
β”‚   └── cleveland.csv               # Dataset (renamed from processed.cleveland.data)
β”‚
β”œβ”€β”€ notebooks/                      # Jupyter notebooks for each step
β”‚   β”œβ”€β”€ 01_load_and_explore.ipynb   # Step 1: Load & Explore dataset
β”‚   β”œβ”€β”€ 02_preprocessing.ipynb      # Step 2: Preprocessing (imputation, scaling, binary target)
β”‚   β”œβ”€β”€ 03_eda.ipynb                # Step 3: Exploratory Data Analysis (EDA)
β”‚   β”œβ”€β”€ 04_model_training.ipynb     # Step 4: Model Training (Logistic Regression & Random Forest)
β”‚   └── 05_evaluation_report.ipynb  # Step 5: One-page summary notebook
β”‚
β”œβ”€β”€ week1_report.md                 # Short 1-page Markdown report
β”œβ”€β”€ week1_report.pdf                # Exported PDF report (for submission)
└── README.md                       # Project documentation (this file)



---

## πŸ“Š Dataset
- **Source:** UCI Machine Learning Repository (Cleveland subset, processed version)  
- **Size:** 303 rows Γ— 14 columns  
- **Target:** `target` (0–4) β†’ binarized to `target_bin` (0 = healthy, 1 = disease)  

---

## βš™οΈ Preprocessing
- Missing values:
  - `ca`, `thal` β†’ filled with **mode** (most frequent value)
  - other numeric columns β†’ filled with **median**
- Features scaled to **[0, 1]** using `MinMaxScaler`
- Final dataset: **13 features + 1 binary target**

---

## πŸ” Exploratory Data Analysis (EDA)
- Class balance: ~54% healthy, ~46% disease  
- Feature distributions plotted (histograms)  
- Correlation heatmap to study feature relationships  

---

## πŸ€– Models & Results
Two models were trained and evaluated:

| Model                | Accuracy |
|-----------------------|----------|
| Logistic Regression   | **0.8525** |
| Random Forest         | **0.9016** βœ… |

**Selected Model:** Random Forest (better accuracy)

---

## πŸ“„ Outcome
- Learned a complete **ML workflow**:
  - Data loading β†’ preprocessing β†’ EDA β†’ model training β†’ evaluation  
- Produced a **1-page report** (`week1_report.pdf`) for submission  
- Random Forest performed best and is the selected baseline model.  

---

About

Week 1 of my AI/ML Internship at DevelopersHub πŸš€ β€” built a disease prediction model using patient data. Explored the UCI Cleveland dataset, handled missing values, ran EDA, and compared Logistic Regression vs Random Forest. Random Forest achieved 90.16% accuracy βœ…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published