This project performs retail customer segmentation and customer lifetime value (CLV) prediction using Online Retail II data to support targeted marketing, retention strategy, and revenue forecasting.
It leverages:
- RFM analysis + KMeans clustering for customer segmentation
- BG/NBD & Gamma-Gamma models for CLV estimation
- XGBoost classifiers and regressors for predictive modeling
Retail businesses often face challenges in understanding which customers are most valuable and how to retain them. This project helps solve that by:
- Identifying customer segments for targeted marketing
- Estimating future value of customers to prioritize high-ROI efforts
- Supporting better decisions for loyalty campaigns, inventory, and engagement
- Segment customers based on recency, frequency, monetary (RFM) analysis
- Predict future customer CLV for better targeted marketing
- Identify high-value customers for prioritization
- Visualize insights for stakeholders and management teams
The dataset consists of 8 features: 7 numerical and 1 categorical.
- Invoice – Customer invoice number
- StockCode – Product unique number
- Description – Product description
- Quantity – Number of products bought by the customer
- InvoiceDate – When the purchase was made
- Price – Cost of product
- Customer ID – Customer unique ID
- Country
RFM segmentation was applied to classify customers based on their transactional patterns. Each customer received a score across Recency, Frequency, and Monetary value. These scores were used to assign them into behavioral segments such as Champions, Loyal Customers, Potential Loyalists, At Risk, and Lost.
This enabled:
- Prioritization of high-value customers for loyalty and upsell strategies
- Identification of at-risk or inactive customers for re-engagement campaigns
- Foundational grouping for downstream CLV modeling and predictive analysis
Used to predict the expected number of future transactions for each customer.
It models:
- Purchase frequency using a Negative Binomial distribution
- Dropout probability (churn) using a Beta distribution
Used to estimate the expected monetary value of future customer purchases.
Assumes:
- Monetary value is independent of purchase frequency
- Transaction values follow a Gamma distribution
- XGBClassifier predicts whether a customer is likely to make a purchase in the future
- XGBRegressor predicts 3-month CLV based on RFM and engineered features
- Both models use SHAP for interpreting feature importance
Exploratory Data Analysis:
- Revenue trends by month and day
- Top products and customers
Customer Segmentation:
- KMeans clustering on scaled RFM values
- Labeled groups: Champions, Loyal Big Spenders, Regulars, Lost
CLV Estimation:
- BG/NBD & Gamma-Gamma models for probabilistic lifetime value
- 3-month CLV and expected profit prediction
Predictive Modeling:
- XGBClassifier for future purchase probability
- XGBRegressor for customer-level CLV
- SHAP plots for feature contribution
Visual Insights:
- Pie charts of segment distribution
- Revenue trend lines
- Top products and customer breakdown
Predicted 3-month CLV was used to:
- Rank customers by expected future value
- Support targeted marketing and loyalty prioritization
- Flag low-CLV or negative-profit customers for churn investigation
- Train supervised ML models for predicting future spend and activity
XGBClassifier
- Accuracy
- Precision, Recall, F1-score
- Confusion Matrix
XGBRegressor
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R² Score
Evaluation metrics were used to tune models and validate predictive performance.
- Python (Pandas, NumPy, Seaborn, Matplotlib)
- Lifetimes
- Scikit-Learn (KMeans, Pipelines, Scaling)
- XGBoost
- SHAP for explainable AI
- Jupyter Notebook for analysis workflow
An interactive Streamlit app was developed for real-time insights and visualizations.
-
Clone the repository:
git clone https://github.com/soromm.git
-
Install dependencies:
pip install -r requirements.txt
-
Run the Streamlit app:
streamlit run app.py
The app allows you to:
- Load and preprocess your dataset
- Perform EDA
- Execute segmentation
- Calculate CLV (BG/NBD + Gamma-Gamma)
- Predict model-based CLV
- Visualize customer clusters and lifetime value
- Segment distribution for actionable marketing targeting
- Top products by sales and quantity for inventory planning
- Customer-level CLV for personalized engagement strategies
- Insights on negative CLV customers for churn investigation
- Add time-decay adjusted CLV predictions
- Incorporate churn probability into long-term value forecasts
- Explore deep learning sequence models for purchase prediction
- Extend to multi-country analysis or product-level segmentation
Contributions are welcome! Feel free to:
- Raise issues for enhancements or bugs
- Suggest improvements or new features
This project is licensed under the MIT License. See the LICENSE file for details.