Check out the live app (deployed on Hugging Face!)
A machine learning-powered Streamlit app that predicts the probability of a product being returned based on customer reviews, delivery metadata, and review ratings.
Product returns in e-commerce lead to significant losses. This project applies machine learning and natural language processing to predict the return likelihood of a product using customer reviews, metadata, and delivery data.
Built with:
- Python and Streamlit for the UI
- Scikit-learn and XGBoost for machine learning
- TextBlob for sentiment analysis
- Plotly and Seaborn for visualization
Feature | Description |
---|---|
Sentiment Analysis | Uses TextBlob to analyze the tone of customer reviews |
Delivery Time Impact | Evaluates how delivery duration affects return chances |
Rating Integration | Leverages 1–5 star ratings to gauge satisfaction |
Helpfulness Ratio | Measures how helpful other users found the review |
Category Encoding | Simulated product category derived from ProductId |
Multiple ML Models | Compare predictions using Logistic Regression, Random Forest, and XGBoost |
Model Insights | Learn how each model works and what features it relies on |
The project uses the Amazon Product Reviews dataset from Kaggle. Core columns include:
Column | Description |
---|---|
Text |
Full customer review |
Score |
Star rating (1 to 5) |
HelpfulnessNumerator |
Number of users who found it helpful |
HelpfulnessDenominator |
Total users who voted |
ProductId , UserId |
Product and user identifiers |
Time |
Review timestamp (Unix format) |
Additional engineered features:
delivery_time
: Simulated shipping durationcategory_encoded
: Encoded first character of ProductIdreview_polarity
: Sentiment score using TextBlobreview_length
: Character count of the reviewhelpfulness_ratio
: Calculated as
helpfulness_ratio = HelpfulnessNumerator / HelpfulnessDenominator
(set to 0 when denominator is 0)
Model | Advantages | Use Case |
---|---|---|
Logistic Regression | Fast and interpretable | Baseline modeling |
Random Forest | Handles nonlinear relationships, less overfit | General tabular problems |
XGBoost | High accuracy, scalable, feature-aware | Preferred for structured feature data |
Model | Accuracy | AUC Score |
---|---|---|
Logistic Regression | 0.79 | 0.72 |
Random Forest | 0.84 | 0.81 |
XGBoost | 0.87 | 0.89 |
pip install -r requirements.txt
streamlit run app.py
Thank you!