Skip to content

Commit 7a2b14f

Browse files
committed
Commit
1 parent 39f5e1d commit 7a2b14f

File tree

1 file changed

+181
-43
lines changed

1 file changed

+181
-43
lines changed

README.md

Lines changed: 181 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,181 @@
1-
# Titanic Survival Prediction
2-
3-
**Written by Lily Gates**
4-
*May 2025*
5-
6-
## Description
7-
This project predicts the survival outcomes of passengers aboard the Titanic using machine learning. It investigates the influence of various factors such as age, gender, passenger class, and fare on survival chances. The analysis utilizes a Random Forest classifier, which outperforms other models like Logistic Regression and Decision Tree in predicting survival based on historical data.
8-
9-
## Methodology
10-
The analysis uses supervised learning, employing three classification models: Logistic Regression, Decision Trees, and Random Forest.
11-
12-
The methodology includes:
13-
* **Data Preprocessing**: Handling missing values, encoding categorical variables, and scaling numerical features.
14-
* **Model Training**: The models are trained on the dataset, which is split into training and test sets.
15-
* **Model Evaluation**: The models are evaluated using performance metrics like accuracy, precision, recall, and F1-score. A confusion matrix is also used to assess model performance.
16-
* **Feature Importance**: The models rank features based on their contribution to the survival prediction.
17-
18-
## Required Dependencies
19-
To run the project, the following Python libraries are required:
20-
* `pandas`
21-
* `numpy`
22-
* `scikit-learn`
23-
* `matplotlib`
24-
* `seaborn`
25-
26-
## Output
27-
The script generates:
28-
* **Feature Importance Plots**: Visualizations showing the most influential factors in predicting survival (e.g., age, gender, fare).
29-
* **Confusion Matrix**: For each model, visualizing true positives, false positives, true negatives, and false negatives.
30-
* **Model Performance Metrics**: Including accuracy, precision, recall, and F1-score for each model.
31-
32-
## Limitations
33-
Despite the Random Forest model outperforming the other models in key metrics, there are several limitations:
34-
1. **Limited Feature Set**: The model was trained on a limited set of features, excluding potentially important variables like cabin location, family identifiers, or group ticket information. This simplification may have overlooked crucial survival patterns.
35-
2. **Overfitting and Bias**: Random Forest models are prone to overfitting, especially when there are many distinct values in the features. The model could also be biased toward features with many categories, such as class or fare, rather than accounting for more nuanced factors.
36-
3. **Contextual Factors**: The analysis does not include critical contextual factors such as proximity to lifeboats, crew behavior, or personal connections, all of which were likely influential during the Titanic disaster.
37-
4. **Generalizability**: The model was validated using a holdout portion of the same dataset, so its performance on unseen data or in different scenarios remains untested.
38-
39-
## Future Improvements
40-
* Experiment with different scaling methods for numerical features.
41-
* Test different tree depth levels for the Decision Tree model to avoid overfitting.
42-
* Explore alternative methods for addressing missing "Age" values.
43-
* Use the Kaggle `test.csv` file to compare the performance of the Random Forest model trained on `train.csv`.
1+
# Titanic Machine Learning 🛳️💻
2+
3+
Welcome to the Titanic Machine Learning repository! This project predicts the survival of Titanic passengers using various machine learning techniques. It explores key factors such as age, gender, and fare to identify what influences survival rates.
4+
5+
[![Releases](https://img.shields.io/github/release/gamy703/titanic_machine_learning.svg)](https://github.com/gamy703/titanic_machine_learning/releases)
6+
7+
## Table of Contents
8+
9+
- [Introduction](#introduction)
10+
- [Features](#features)
11+
- [Technologies Used](#technologies-used)
12+
- [Getting Started](#getting-started)
13+
- [Data Exploration](#data-exploration)
14+
- [Machine Learning Models](#machine-learning-models)
15+
- [Results](#results)
16+
- [Contributing](#contributing)
17+
- [License](#license)
18+
- [Contact](#contact)
19+
20+
## Introduction
21+
22+
The Titanic disaster remains one of the most discussed maritime tragedies. In this project, we aim to analyze the Titanic dataset to predict passenger survival. By applying machine learning algorithms, we can identify which factors played a significant role in survival. This project uses Logistic Regression, Decision Trees, and Random Forest algorithms to perform classification.
23+
24+
## Features
25+
26+
- Predicts passenger survival based on various features.
27+
- Utilizes Logistic Regression, Decision Tree, and Random Forest algorithms.
28+
- Analyzes key factors like age, gender, and fare.
29+
- Visualizes data for better understanding.
30+
- Easy to use and modify.
31+
32+
## Technologies Used
33+
34+
This project employs several technologies and libraries:
35+
36+
- **Python**: The primary programming language.
37+
- **Pandas**: For data manipulation and analysis.
38+
- **NumPy**: For numerical computations.
39+
- **Matplotlib**: For data visualization.
40+
- **Seaborn**: For statistical data visualization.
41+
- **Scikit-learn**: For implementing machine learning algorithms.
42+
43+
## Getting Started
44+
45+
To get started with this project, follow these steps:
46+
47+
1. **Clone the Repository**:
48+
```bash
49+
git clone https://github.com/gamy703/titanic_machine_learning.git
50+
cd titanic_machine_learning
51+
```
52+
53+
2. **Install Required Libraries**:
54+
Ensure you have Python installed, then run:
55+
```bash
56+
pip install -r requirements.txt
57+
```
58+
59+
3. **Download the Dataset**:
60+
You can find the Titanic dataset on [Kaggle](https://www.kaggle.com/c/titanic/data). Download the dataset and place it in the project directory.
61+
62+
4. **Run the Project**:
63+
Execute the main script to see the predictions:
64+
```bash
65+
python main.py
66+
```
67+
68+
5. **Check Releases**:
69+
For the latest updates and releases, visit [Releases](https://github.com/gamy703/titanic_machine_learning/releases).
70+
71+
## Data Exploration
72+
73+
Before diving into machine learning, it’s crucial to explore the dataset. The Titanic dataset contains various features that can influence survival:
74+
75+
- **PassengerId**: Unique identifier for each passenger.
76+
- **Survived**: Survival status (0 = No, 1 = Yes).
77+
- **Pclass**: Ticket class (1st, 2nd, 3rd).
78+
- **Name**: Passenger name.
79+
- **Sex**: Gender of the passenger.
80+
- **Age**: Age in years.
81+
- **SibSp**: Number of siblings or spouses aboard.
82+
- **Parch**: Number of parents or children aboard.
83+
- **Ticket**: Ticket number.
84+
- **Fare**: Fare paid for the ticket.
85+
- **Cabin**: Cabin number.
86+
- **Embarked**: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
87+
88+
### Visualizing Data
89+
90+
We utilize Matplotlib and Seaborn to visualize relationships between different features and survival rates. Some key visualizations include:
91+
92+
- **Survival by Gender**: Understanding how gender affects survival rates.
93+
- **Age Distribution**: Analyzing age groups and their survival rates.
94+
- **Fare Distribution**: Exploring how fare correlates with survival.
95+
96+
```python
97+
import seaborn as sns
98+
import matplotlib.pyplot as plt
99+
100+
# Example visualization
101+
sns.countplot(x='Survived', hue='Sex', data=data)
102+
plt.title('Survival Count by Gender')
103+
plt.show()
104+
```
105+
106+
## Machine Learning Models
107+
108+
This project implements three primary machine learning models:
109+
110+
### 1. Logistic Regression
111+
112+
Logistic Regression is a statistical method for predicting binary classes. It estimates the probability that a given input point belongs to a certain class.
113+
114+
### 2. Decision Tree
115+
116+
A Decision Tree uses a tree-like model to make decisions based on feature values. It splits the data into subsets based on the value of features.
117+
118+
### 3. Random Forest
119+
120+
Random Forest is an ensemble learning method that constructs multiple decision trees and merges them to improve accuracy and control overfitting.
121+
122+
### Model Evaluation
123+
124+
Each model is evaluated using metrics such as accuracy, precision, recall, and F1-score. We utilize cross-validation to ensure our models generalize well to unseen data.
125+
126+
```python
127+
from sklearn.model_selection import train_test_split
128+
from sklearn.metrics import accuracy_score
129+
130+
# Example code for model evaluation
131+
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
132+
model.fit(X_train, y_train)
133+
predictions = model.predict(X_test)
134+
accuracy = accuracy_score(y_test, predictions)
135+
print(f'Model Accuracy: {accuracy}')
136+
```
137+
138+
## Results
139+
140+
After training and evaluating the models, we compare their performance. The Random Forest model often yields the best accuracy, followed by Decision Trees and Logistic Regression.
141+
142+
### Feature Importance
143+
144+
Understanding which features contribute most to survival can guide future decisions. We can visualize feature importance using:
145+
146+
```python
147+
importances = model.feature_importances_
148+
feature_names = X.columns
149+
indices = np.argsort(importances)[::-1]
150+
151+
plt.figure()
152+
plt.title("Feature Importances")
153+
plt.bar(range(X.shape[1]), importances[indices], align="center")
154+
plt.xticks(range(X.shape[1]), feature_names[indices], rotation=90)
155+
plt.xlim([-1, X.shape[1]])
156+
plt.show()
157+
```
158+
159+
## Contributing
160+
161+
Contributions are welcome! If you want to contribute to this project, please follow these steps:
162+
163+
1. Fork the repository.
164+
2. Create a new branch (`git checkout -b feature-branch`).
165+
3. Make your changes.
166+
4. Commit your changes (`git commit -m 'Add new feature'`).
167+
5. Push to the branch (`git push origin feature-branch`).
168+
6. Create a pull request.
169+
170+
## License
171+
172+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
173+
174+
## Contact
175+
176+
For questions or suggestions, feel free to reach out:
177+
178+
- GitHub: [gamy703](https://github.com/gamy703)
179+
- Email: gamy703@example.com
180+
181+
Explore the Titanic dataset and enhance your machine learning skills! For updates, check the [Releases](https://github.com/gamy703/titanic_machine_learning/releases).

0 commit comments

Comments
 (0)