This repository contains a machine learning project built to predict survival outcomes for passengers on the Titanic. Using passenger attributes like age, class, family size, and fare, this model achieves 78% accuracy on the test dataset and 83% on the training dataset.
I’m diving into the legendary Titanic Machine Learning competition on Kaggle—a rite of passage for data scientists everywhere. Just like the ‘unsinkable’ ship, I’m hoping to stay afloat as I predict who survives! Tackling this dataset feels like an initiation into the world of data science, where feature engineering and model selection are my life vests. 🚢🛟
- notebook: Director for the main Jupyter notebook, containing the complete project workflow, including EDA, data preprocessing, feature engineering, model building, hyperparameter tuning, and evaluation.
Titanic_ML.ipynb
: The main Jupyter notebook.
model/
: Directory for storing trained model.optimized_stacking_ensemble_model.joblib
: The saved trained model.
README.md
: Project overview and usage instructions.
- Analyzed distributions of key features, survival rates by class and age, and correlations between variables to guide feature engineering.
- Feature Engineering: Created new features such as
Family_Size
,Is_Alone
, and extractedTitle
from passenger names. - Handling Missing Values: Imputed missing values using mean or median for continuous features and mode for categorical features.
- Encoding: Applied one-hot encoding to categorical features.
- Scaling: Used
StandardScaler
to standardize numerical features.
- Models Used: Random Forest and Gradient Boosting were chosen as primary models due to their strong performance on tabular data.
- Bagging and Stacking Ensemble: Applied
BaggingClassifier
to enhance generalization of Gradient Boosting and combined base models in a stacking ensemble with Logistic Regression as the meta-model. - Hyperparameter Tuning:
GridSearchCV
was used for comprehensive tuning of Random Forest and Gradient Boosting, whileRandomizedSearchCV
was used for faster tuning.
- Train Accuracy: 83%
- Test Accuracy: 78%
The following optimizations reduced execution time:
- Used a focused hyperparameter grid for key models to limit search space.
- Set
n_jobs=-1
to utilize all CPU cores. - Reduced cross-validation folds from 5 to 3 for
RandomizedSearchCV
. - Limited tuning primarily to Random Forest and Gradient Boosting, the models with the greatest impact.
List of required libraries:
# Data Manipulation
pandas
numpy
# Visualization
matplotlib
seaborn
# Machine Learning Models and Tools
scikit-learn
joblib
-
Clone the repository and navigate to the project:
git clone https://github.com/your-username/Titanic_Survival_Prediction.git cd Titanic_Survival_Prediction
-
Install dependencies:
pip install -r requirements.txt
-
Run the notebook: Open
Titanic_ML.ipynb
in Jupyter Notebook and execute the cells to preprocess data, train, and evaluate the model.jupyter notebook Titanic_ML.ipynb
-
Use the trained model for predictions:
import joblib model = joblib.load('models/optimized_stacking_ensemble_model.joblib') predictions = model.predict(test_clean)
The model achieves 78% accuracy on the test dataset and 83% on the training dataset. This performance demonstrates a balance between underfitting and overfitting, with potential for further improvements.
Contributions are welcome! Feel free to open a pull request or submit issues for bugs or enhancement requests.
This project is open-source and available under the MIT License.