Project Overview
Motivation:
This project was completed as part of my Introduction to Data Science course, where we focused on creating a tutorial that walks users through the entire data science pipeline: data curation, parsing, and management; exploratory data analysis; model building as either hypothesis testing and/or machine learning; and then the curation of a message or messages covering insights learned during the tutorial. My project involves analyzing NBA statistics to predict the league's Most Valuable Player (MVP) and understanding the most influential factors in determining the award.
Important Links:
Class Project Website: Final Tutorial Project Page
Final Portfolio (GitHub Pages): GitHub Pages Website
GitHub Repository: GitHub Repository
Project Idea and Goals:
For my project, I will be analyzing datasets containing various NBA statistics from many seasons. I hope that, through ETL, EDA, and Model Building, I will be able to predict who will be the NBA's Most Valuable Player, and perhaps determine what feature or set of features has the most weight in determining the MVP. I will be doing my coding in Python on Google Colaboratory and uploading it here on GitHub. Some of the libraries that will be used include
Pandas
, NumPy
, SQL
, Seaborn
, and more.Project Workflow:
- ETL (Extract, Transform, Load):
- Gathered NBA statistics from publicly available datasets.
- Cleaned and transformed the data to ensure consistency and accuracy.
- Dealt with missing values and standardized metrics for analysis.
- Exploratory Data Analysis (EDA):
- Visualized trends in player performance over multiple seasons.
- Analyzed correlations between various player statistics and MVP winners.
- Highlighted key patterns using Python libraries like Seaborn and Matplotlib.
- Model Building:
- Experimented with machine learning models to predict the MVP.
- Conducted feature selection to identify the most influential attributes.
- Evaluated model performance using metrics like accuracy, precision, recall, and F1-score.
- Findings:
- Rebounding metrics, scoring efficiency, and advanced stats like VORP (value over replacement player) were the most influential predictors.
Project Review
Skills Gained:
- Data Wrangling: Cleaning and transforming complex datasets.
- Statistical Analysis: Identifying trends and drawing meaningful conclusions.
- Machine Learning: Building and evaluating predictive models.
- Visualization: Creating compelling graphics to communicate insights.
- Technical Communication: Documenting and presenting findings effectively.
Reflection:
Completing this project reinforced the importance of data quality and thoughtful feature engineering. It also highlighted the challenges of building models that generalize well. I now better appreciate the iterative nature of data science projects and the importance of the balance between technical precision and storytelling to communicate your findings.
Gallery:
Model | Accuracy | ROC AUC | Precision (Non-MVP) | Recall (Non-MVP) | F1
(Non-MVP) | Precision (MVP) | Recall (MVP) | F1 (MVP) |
Base Model | 97.97% | 0.964811 | 0.98 | 1.00 | 0.99 | 0.56 | 0.26 | 0.36 |
Feature-Engineered Model | 98.09% | 0.956272 | 0.98 | 1.00 | 0.99 | 0.62 | 0.26 | 0.37 |
Tuned Model | 94.03% | 0.994489 | 1.00 | 0.94 | 0.97 | 0.26 | 0.95 | 0.40 |
New Model | 98.99% | 0.993519 | 0.99 | 1.00 | 0.99 | 0.86 | 0.63 | 0.73 |