This repository contains code and models for predicting customer churn in the telecommunications industry. 📊 The project uses machine learning and statistical techniques to identify customers at risk of leaving.
The eda.ipynb
notebook performs an in-depth Exploratory Data Analysis (EDA) on a telecom churn dataset to uncover:
- 📆 Service Agreements: Month-to-month customers churn more than those on annual/bi-annual plans
- ⏳ Tenure: Newer customers show higher churn risk
- 💸 Charges: High monthly charges and sudden increases are churn signals
- 📡 Services: Fewer service add-ons (e.g., no internet/security) = higher churn
- 💳 Payment Methods: Customers using electronic checks churn more frequently
- 🔹 Go to the Demo Data tab
- 🔹 Click Run Prediction
- 🔹 (Optional) Click Evaluate Predictions
- 🟢 Go to the Manual Input tab
- 🟢 Fill in customer details
- 🟢 Click Predict Churn
- 📤 Go to the Upload Data tab
- 📤 Upload your CSV file
- 📤 (Optional) Upload true labels for evaluation
⚙️ Model | 🎯 CV Mean AUC (± std) | 🧪 Test AUC |
---|---|---|
🐱 CatBoost | 0.8511 ± 0.0146 | 0.8472 |
🐱 CatBoost (tuned) | (best trial) — 0.8385 | — |
⚔️ XGBoost | 0.8503 ± 0.0147 | 0.8481 |
⚔️ XGBoost (tuned) | (best trial) — 0.8407 | — |
💡 LightGBM | 0.8482 ± 0.0154 | 0.8506 |
💡 LightGBM (tuned) | (best trial) — 0.8383 | — |
🧠 Stacking Ensemble | — | 0.8491 |
- ✅ Accuracy: 81.23%
- 🔢 Confusion Matrix:
Predicted No | Predicted Yes | |
---|---|---|
Actual No | 4,697 | 477 |
Actual Yes | 845 | 1,024 |
- ✅ Accuracy: 81.58%
- 🔢 Confusion Matrix:
Predicted No | Predicted Yes | |
---|---|---|
Actual No | 4,733 | 441 |
Actual Yes | 856 | 1,013 |
- ✅ Accuracy: 81.56%
- 🔢 Confusion Matrix:
Predicted No | Predicted Yes | |
---|---|---|
Actual No | 4,725 | 449 |
Actual Yes | 850 | 1,019 |
- ✅ Accuracy: 81.16%
- 🔢 Confusion Matrix:
Predicted No | Predicted Yes | |
---|---|---|
Actual No | 4,776 | 398 |
Actual Yes | 929 | 940 |
This dataset includes 7,043 telecom customers with 21 features: demographics, service usage, account info, and churn status.
🏷️ Column | 📝 Description |
---|---|
customerID |
Unique customer ID |
gender |
Gender: Male or Female |
SeniorCitizen |
1 = Senior citizen, 0 = Not |
Partner |
Whether the customer has a partner |
Dependents |
Whether the customer has dependents |
tenure |
Months with the company |
PhoneService |
Phone service subscription |
MultipleLines |
Has multiple phone lines |
InternetService |
DSL, Fiber optic, or None |
OnlineSecurity |
Has online security add-on |
OnlineBackup |
Has online backup add-on |
DeviceProtection |
Has device protection add-on |
TechSupport |
Has technical support add-on |
StreamingTV |
Has streaming TV |
StreamingMovies |
Has streaming movies |
Contract |
Contract type: Month-to-month, One year, Two year |
PaperlessBilling |
Uses paperless billing |
PaymentMethod |
Payment type (e.g., Electronic check) |
MonthlyCharges |
Monthly bill amount |
TotalCharges |
Total charged amount (as string; needs conversion) |
Churn |
Whether customer churned: Yes or No |
- 📦 Categorical feature review
- ❌ Duplicate detection
- 🔍 Unique value profiling
- 🧼 Handle missing values
- 🧱 Feature binning
-
🧪 Normality tests
- D’Agostino-Pearson
- Anderson-Darling
-
📊 Individual visualizations
-
🔗 Correlations:
- 📉 Numerical vs Numerical (Spearman)
- 📋 Categorical vs Numerical (Kendall’s Tau, Mann-Whitney U)
- 🔘 Dichotomous (Phi Coefficient)
- 🔀 Categorical vs Categorical (Chi-Square, Cramér’s V, Uncertainty Coefficient)
-
🔄 Collinearity Checks
-
🖼️ Feature Pair Visualizations
- 🧠 Multicollinearity
- 📊 Frequency Distribution
- 🔥 One-Hot Encoding
- ✂️ Train-Test Split
- 🧮 Encoding + Scaling
- 🐱 CatBoost
- ⚔️ XGBoost
- 💡 LightGBM
- 🧠 Stacking Ensemble