The global e-commerce industry generates vast amounts of transaction data daily, offering valuable insights into customer purchasing behaviors. Analyzing this data is essential for identifying meaningful customer segments and recommending relevant products to enhance customer experience and drive business growth. This project aims to examine transaction data from an online retail business to uncover patterns in customer purchase behavior, segment customers based on Recency, Frequency, and Monetary (RFM) analysis, and develop a product recommendation system using collaborative filtering techniques. project streamlit live demo:https://shopper-spectrum-segmentation-and-recomm-vjm2xqsxru5selsagyxwj.streamlit.app/
- Public Dataset Exploration and Preprocessing
- Data Cleaning and Feature Engineering
- Exploratory Data Analysis (EDA)
- Clustering Techniques
- Collaborative Filtering-based Product Recommendation
- Model Evaluation and Customer Segmentation Interpretation
- Streamlit
E-Commerce and Retail Analytics
The core problem addressed is to leverage e-commerce transaction data to understand customer purchasing behaviors, segment customers effectively using RFM analysis, and build a robust product recommendation system. This approach aims to enhance customer experience and drive business growth through targeted marketing and personalized product suggestions.
dataset:https://drive.google.com/file/d/1rzRwxm_CJxcRzfoo9Ix37A2JTlMummY-/view?usp=drive_link
- Customer Segmentation for Targeted Marketing Campaigns: Identify distinct customer groups for personalized marketing strategies.
- Personalized Product Recommendations on E-Commerce Platforms: Offer relevant product suggestions to individual customers, boosting sales and engagement.
- Identifying At-Risk Customers for Retention Programs: Proactively identify and engage customers showing signs of churn.
- Dynamic Pricing Strategies Based on Purchase Behavior: Adjust product prices based on customer segment and purchasing patterns.
- Inventory Management and Stock Optimization Based on Customer Demand Patterns: Optimize stock levels by forecasting demand based on customer segments.
- Unsupervised Machine Learning β Clustering
- Collaborative Filtering β Recommendation System
Metric | Value |
---|---|
Total Transactions | ~541,909 |
Unique Products | ~4,000+ |
Unique Customers | ~38,000 |
Transaction Period | Dec 2022 β Dec 2023 |
Countries Represented | ~37 |
Missing Customer IDs | ~24.9% of rows filtered out |
Action | Count Removed |
---|---|
Rows with Missing CustomerID | ~135,000+ |
Cancelled Invoices (InvoiceNo "C") | ~9,600+ |
Negative/Zero Quantity or Price | ~17,000+ |
RFM Metric | Min | Max | Mean |
---|---|---|---|
Recency | 1 | 373 | ~92.6 |
Frequency | 1 | 209 | ~4.4 |
Monetary (Β£) | 3.75 | 28000+ | ~440.3 |
Metric | Value |
---|---|
Algorithm | KMeans (Scikit-learn) |
Features Used | RFM (scaled) |
Optimal Clusters | 4 |
Silhouette Score | ~0.46 |
Cluster Labels | High-Value, Regular, Occasional, At-Risk |
Cluster # | Segment Label | % of Customers |
---|---|---|
0 | High-Value | ~8β10% |
1 | Regular | ~30% |
2 | Occasional | ~45% |
3 | At-Risk | ~15% |
Metric | Value |
---|---|
Technique | Item-based Collaborative Filtering |
Similarity Metric | Cosine Similarity |
Recommendations per Product | Top 5 |
Matrix Shape | ~38,000 (Customers) x ~3,900 (Products) |
Average Products Purchased | ~7β10 per customer |
Module | Features |
---|---|
Product Recommender | Input product β Recommends 5 similar items |
Customer Segmentation | Input RFM values β Predicts customer segment |
Backend Models | kmeans_model.joblib , product_matrix.pkl |
Frontend Tool | Built with Streamlit |
- Dataset: Link to Dataset (assuming this is the dataset based on description)
- Explore the dataset to understand the structure and data types.
- Identify missing values, duplicates, and unusual records.
Column | Description |
---|---|
InvoiceNo |
Transaction number |
StockCode |
Unique product/item code |
Description |
Name of the product |
Quantity |
Number of products purchased |
InvoiceDate |
Date and time of transaction (2022β2023) |
UnitPrice |
Price per product |
CustomerID |
Unique identifier for each customer |
Country |
Country where the customer is based |
- Remove rows with missing
CustomerID
. - Exclude cancelled invoices (
InvoiceNo
starting with 'C'). - Remove negative or zero quantities and prices.
- Analyze transaction volume by country.
- Identify top-selling products.
- Visualize purchase trends over time.
- Inspect monetary distribution per transaction and customer.
- RFM distributions.
- Elbow curve for cluster selection.
- Customer cluster profiles.
- Product recommendation heatmap / similarity matrix.
-
Feature Engineering:
- Calculate Recency = Latest purchase date in dataset β Customerβs last purchase date
- Calculate Frequency = Number of transactions per customer
- Calculate Monetary = Total amount spent by customer
-
Standardize/Normalize the RFM values.
-
Choose Clustering Algorithm (KMeans, DBScan, Hierarchical etc.).
-
Use Elbow Method and Silhouette Score to decide the number of clusters.
-
Run Clustering.
-
Label the clusters by interpreting their RFM averages:
Cluster Characteristics Segment Label High R, High F, High M Regular, frequent, recent, big spenders High-Value Medium F, Medium M Steady purchasers but not premium Regular Low F, Low M, older R Rare, occasional purchases Occasional High R, Low F, Low M Havenβt purchased in a long time At-Risk -
Visualize the clusters using a scatter plot or 3D plot of RFM scores.
-
Save the best performing model for Streamlit usage.
- Use Item-based Collaborative Filtering.
- Compute cosine similarity (or another similarity metric) between products based on purchase history (
CustomerIDβStockCode
matrix). - Return top 5 similar products to the entered product name.
Objective: When a user inputs a product name, the app recommends 5 similar products based on collaborative filtering.
Functionality:
- Text input box for Product Name
- Button:
Get Recommendations
- Display 5 recommended products as a styled list or card view
π Functionality:
- 3 number inputs for:
- Recency (in days)
- Frequency (number of purchases)
- Monetary (total spend)
- Button:
Predict Cluster
- Display: Cluster label (e.g., High-Value, Regular, Occasional, At-Risk)
Pandas
, Numpy
, DataCleaning
, FeatureEngineering
, EDA
, RFMAnalysis
, CustomerSegmentation
, KMeansClustering
, CollaborativeFiltering
, CosineSimilarity
, ProductRecommendation
, ScikitLearn
, StandardScaler
, StreamlitApp
, MachineLearning
, DataVisualization
, PivotTables
, DataTransformation
, RealTimePrediction
- π Python Notebook with:
- Clean, well-documented code with comments.
- Visualizations for EDA and clustering insights.
- RFM-based customer segmentation and product similarity analysis.
- Model evaluations for clustering (like inertia, silhouette score).
- π Streamlit Web Application:
- User input for a product name β recommends 5 similar products.
- Customer behavior input (Recency, Frequency, Monetary) β predicts cluster segment.
- Clean, interactive UI with real-time outputs.
π¬ Contact Rahul Rai π§ rahulraimau5@gmail.com π GitHub | LinkedIn