A comprehensive machine learning project that predicts student math scores using demographic and academic features. The system implements a complete MLOps pipeline from data ingestion to model deployment with a user-friendly web interface.
- End-to-End ML Pipeline: Complete workflow from data ingestion to model deployment
- Multiple Algorithm Comparison: Tests 7 different regression algorithms with hyperparameter tuning
- Real-time Predictions: Flask web application for instant score predictions
- Automated Model Selection: Automatically selects the best performing model based on R² score
- Data Preprocessing: Handles categorical encoding and feature scaling
- Modular Architecture: Well-structured codebase with separate components for each pipeline stage
student_performance/
├── artifacts/ # Stored models and preprocessors
│ ├── model.pkl # Trained ML model
│ ├── preprocessor.pkl # Data preprocessing pipeline
│ ├── train.csv # Training dataset
│ ├── test.csv # Testing dataset
│ └── data.csv # Raw dataset
├── notebook/
│ └── data/
│ └── stud.csv # Original dataset
├── src/
│ ├── components/ # Core ML pipeline components
│ │ ├── __init__.py
│ │ ├── data_ingestion.py # Data loading and splitting
│ │ ├── data_transformation.py # Data preprocessing
│ │ └── model_trainer.py # Model training and selection
│ ├── pipeline/ # Prediction pipeline
│ │ ├── __init__.py
│ │ └── predict_pipeline.py # Inference pipeline
│ ├── __init__.py
│ ├── exception.py # Custom exception handling
│ ├── logger.py # Logging configuration
│ └── utils.py # Utility functions
├── templates/ # HTML templates for web app
│ ├── index.html # Homepage template
│ └── home.html # Prediction form template
├── app.py # Flask web application
├── requirements.txt # Project dependencies
└── README.md # Project documentation
- Loads student performance dataset
- Splits data into training (80%) and testing (20%) sets
- Saves processed datasets to artifacts folder
- Handles categorical variables (gender, ethnicity, education level, etc.)
- Applies feature scaling using StandardScaler
- Creates preprocessing pipeline for consistent data transformation
The system evaluates multiple regression algorithms:
- Random Forest Regressor
- Decision Tree Regressor
- Gradient Boosting Regressor
- Linear Regression
- XGBoost Regressor
- CatBoost Regressor
- AdaBoost Regressor
Each model undergoes hyperparameter tuning using GridSearchCV to find optimal parameters.
- Automatically selects the best performing model based on R² score
- Requires minimum R² score of 0.6 for model acceptance
- Saves the best model for production use
- Gender: Male/Female
- Race/Ethnicity: Student's ethnic background
- Parental Level of Education: Education level of parents
- Lunch: Standard or free/reduced lunch
- Test Preparation Course: Completed or not completed
- Reading Score: Student's reading test score
- Writing Score: Student's writing test score
- Math Score Prediction: Predicted math test score (0-100)
- Python 3.8
- Scikit-learn: Machine learning algorithms and preprocessing
- XGBoost & CatBoost: Advanced boosting algorithms
- Flask: Web application framework
- Pandas & NumPy: Data manipulation and analysis
- HTML/CSS: Frontend interface
The system automatically selects the best performing model based on R² score evaluation on test data, ensuring reliable predictions for student math performance.