This project delves into the rich dataset of English Premier League (EPL) matches to uncover insights using statistical analysis and machine learning models. The analysis covers factors such as team formations, home advantage, shooting efficiency, and match outcomes. The ultimate aim is to enhance strategic decision-making and contribute to the growing field of football analytics.
- 📊 Raw Data: Contains the original data web scraped from the
fbref
website andStathead
platform. - 🧹 Processed Data: Includes cleaned and feature-engineered datasets used for analysis and model training.
- Includes detailed project documentation, the final report, and research notes explaining methods and findings.
- Python notebooks implementing machine learning models like Logistic Regression, Random Forest, Gradient Boosting, and Neural Networks. Each notebook provides performance metrics and comparative analysis.
- 🛠️ Data Collection: Scripts using web scraping tools like BeautifulSoup to extract data from
fbref
and handle API interactions withStathead
. - 📈 EDA: Notebooks for exploratory data analysis, including visualizations and initial insights into the dataset.
- 🔍 Hypothesis Testing: Code to validate hypotheses such as the impact of formations on win rates or the statistical significance of home advantage.
- Data was collected using web scraping techniques from the
fbref
website. Initial scraping was limited by rate limits, prompting the use of a paid subscription toStathead
for extended access. - The dataset comprises over 5,000 matches, covering seven EPL seasons, with over 27 columns of data, including team stats, results, and match metadata.
- Features such as
Avg_Goals_Scored_3
,Goal_Efficiency
, andShooting_Accuracy
were engineered for deeper analysis.
- Gradient Boosting: The top-performing model, achieving the highest accuracy and effectively capturing complex relationships.
- Random Forest: A close second, offering robust performance and feature importance insights.
- Neural Networks: Demonstrated potential but required more computational resources and fine-tuning.
- Other models included Logistic Regression, SVM, Naive Bayes, and Decision Trees, each offering unique strengths and limitations.
- Hypotheses tested included:
- Teams with higher shooting accuracy have higher win rates.
- Home advantage plays a statistically significant role in match outcomes.
- Tools like t-tests and binomial tests were used for hypothesis validation.
- 📡 Radar charts to compare performance metrics across top teams.
- 🗺️ Heatmaps showing correlations between match statistics.
- 🟠 Scatter plots highlighting relationships like xG (expected goals) vs. GF (goals for).
- 📊 Bar charts depicting winning percentages and other key trends.
- Data Collection: BeautifulSoup, Requests
- Data Analysis: Pandas, Numpy
- Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn, TensorFlow
- Documentation: Jupyter Notebooks, Markdown
- Gradient Boosting emerged as the top-performing model, with an accuracy of 61.46%, highlighting its ability to model complex relationships.
- Hypothesis tests confirmed significant factors, including the critical role of shooting efficiency and the value of certain formations.
- Visualizations provided actionable insights into team strategies and match outcomes.
-
📝 Account Setup:
- Create an account on
Stathead
for extended data access if you plan to replicate the data collection process. - Use the scripts in
Scripts/Data Collection
to scrape and preprocess data.
- Create an account on
-
📉 Run Analysis:
- Notebooks in the
Scripts/EDA
folder provide initial insights and visualizations. - Use hypothesis testing scripts for statistical validations.
- Notebooks in the
-
🤖 Train Models:
- The
Models/
folder contains training scripts for various machine learning models. - Modify parameters as needed to explore different configurations.
- The
-
📑 Review Results:
- Refer to the
Docs/
folder for a comprehensive project report and summarized findings.
- Refer to the
- Special thanks to
fbref
andStathead
for providing access to EPL match data. - Appreciation to the University of Illinois at Chicago (UIC) for resources and guidance in CS418.