Skip to content

This project leverages RFM analysis, KMeans clustering, and probabilistic models (BG/NBD and Gamma-Gamma) to segment customers and estimate Customer Lifetime Value (CLV) using the Online Retail II dataset. It also integrates XGBoost models to predict future purchasing behavior and CLV, with interactive visualizations via a Streamlit dashboard.

Notifications You must be signed in to change notification settings

Soromm/Retail-customer

Repository files navigation

Retail Customer Segmentation and Lifetime Value Prediction

This project performs retail customer segmentation and customer lifetime value (CLV) prediction using Online Retail II data to support targeted marketing, retention strategy, and revenue forecasting.

It leverages:

  • RFM analysis + KMeans clustering for customer segmentation
  • BG/NBD & Gamma-Gamma models for CLV estimation
  • XGBoost classifiers and regressors for predictive modeling

Business Context

Retail businesses often face challenges in understanding which customers are most valuable and how to retain them. This project helps solve that by:

  • Identifying customer segments for targeted marketing
  • Estimating future value of customers to prioritize high-ROI efforts
  • Supporting better decisions for loyalty campaigns, inventory, and engagement

Project Goals

  • Segment customers based on recency, frequency, monetary (RFM) analysis
  • Predict future customer CLV for better targeted marketing
  • Identify high-value customers for prioritization
  • Visualize insights for stakeholders and management teams

Dataset Description

The dataset consists of 8 features: 7 numerical and 1 categorical.

  • Invoice – Customer invoice number
  • StockCode – Product unique number
  • Description – Product description
  • Quantity – Number of products bought by the customer
  • InvoiceDate – When the purchase was made
  • Price – Cost of product
  • Customer ID – Customer unique ID
  • Country

RFM Segmentation

RFM segmentation was applied to classify customers based on their transactional patterns. Each customer received a score across Recency, Frequency, and Monetary value. These scores were used to assign them into behavioral segments such as Champions, Loyal Customers, Potential Loyalists, At Risk, and Lost.

This enabled:

  • Prioritization of high-value customers for loyalty and upsell strategies
  • Identification of at-risk or inactive customers for re-engagement campaigns
  • Foundational grouping for downstream CLV modeling and predictive analysis

Modeling Approaches

📈 BG/NBD Model (Beta-Geometric/Negative Binomial Distribution)

Used to predict the expected number of future transactions for each customer.
It models:

  • Purchase frequency using a Negative Binomial distribution
  • Dropout probability (churn) using a Beta distribution

💰 Gamma-Gamma Model

Used to estimate the expected monetary value of future customer purchases.
Assumes:

  • Monetary value is independent of purchase frequency
  • Transaction values follow a Gamma distribution

🤖 XGBoost Classifier and Regressor

  • XGBClassifier predicts whether a customer is likely to make a purchase in the future
  • XGBRegressor predicts 3-month CLV based on RFM and engineered features
  • Both models use SHAP for interpreting feature importance

Dataset Link


Key Features

Exploratory Data Analysis:

  • Revenue trends by month and day
  • Top products and customers

Customer Segmentation:

  • KMeans clustering on scaled RFM values
  • Labeled groups: Champions, Loyal Big Spenders, Regulars, Lost

CLV Estimation:

  • BG/NBD & Gamma-Gamma models for probabilistic lifetime value
  • 3-month CLV and expected profit prediction

Predictive Modeling:

  • XGBClassifier for future purchase probability
  • XGBRegressor for customer-level CLV
  • SHAP plots for feature contribution

Visual Insights:

  • Pie charts of segment distribution
  • Revenue trend lines
  • Top products and customer breakdown

CLV Use Case

Predicted 3-month CLV was used to:

  • Rank customers by expected future value
  • Support targeted marketing and loyalty prioritization
  • Flag low-CLV or negative-profit customers for churn investigation
  • Train supervised ML models for predicting future spend and activity

Model Evaluation

XGBClassifier

  • Accuracy
  • Precision, Recall, F1-score
  • Confusion Matrix

XGBRegressor

  • Root Mean Squared Error (RMSE)
  • Mean Absolute Error (MAE)
  • R² Score

Evaluation metrics were used to tune models and validate predictive performance.


Technologies Used

  • Python (Pandas, NumPy, Seaborn, Matplotlib)
  • Lifetimes
  • Scikit-Learn (KMeans, Pipelines, Scaling)
  • XGBoost
  • SHAP for explainable AI
  • Jupyter Notebook for analysis workflow

Streamlit Dashboard

An interactive Streamlit app was developed for real-time insights and visualizations.

How to Run the App

  1. Clone the repository:
    git clone https://github.com/soromm.git

  2. Install dependencies:
    pip install -r requirements.txt

  3. Run the Streamlit app:
    streamlit run app.py

The app allows you to:

  • Load and preprocess your dataset
  • Perform EDA
  • Execute segmentation
  • Calculate CLV (BG/NBD + Gamma-Gamma)
  • Predict model-based CLV
  • Visualize customer clusters and lifetime value

Results Interpretation

  • Segment distribution for actionable marketing targeting
  • Top products by sales and quantity for inventory planning
  • Customer-level CLV for personalized engagement strategies
  • Insights on negative CLV customers for churn investigation

Future Work

  • Add time-decay adjusted CLV predictions
  • Incorporate churn probability into long-term value forecasts
  • Explore deep learning sequence models for purchase prediction
  • Extend to multi-country analysis or product-level segmentation

Contributing

Contributions are welcome! Feel free to:

  • Raise issues for enhancements or bugs
  • Suggest improvements or new features

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

This project leverages RFM analysis, KMeans clustering, and probabilistic models (BG/NBD and Gamma-Gamma) to segment customers and estimate Customer Lifetime Value (CLV) using the Online Retail II dataset. It also integrates XGBoost models to predict future purchasing behavior and CLV, with interactive visualizations via a Streamlit dashboard.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published