NBA Player Stats Forecast

type
status
date
slug
summary
tags
category
icon
password

1. Project Background

With the increasing intensity of NBA games and global attention, players' seasonal performance data has become an important basis for fans, analysts, and team decision-makers to evaluate player abilities and game trends. Based on NBA player game data from a certain season, this project conducts comprehensive data analysis to reveal patterns and trends in player performance, explore characteristics of different player types, and provide data support for team personnel selection, tactical deployment, and future season predictions.
 

2. Data Description

Field
Description
URL
Player statistics page URL
player_name
Player name
player_games_played
Number of games played
player_games_started
Number of games started
player_minutes_per_game
Average minutes played per game
player_points_per_game
Average points scored per game
player_offensive_rebounds_per_game
Average offensive rebounds per game
player_defensive_rebounds_per_game
Average defensive rebounds per game
player_rebounds_per_game
Average total rebounds per game
player_assists_per_game
Average assists per game
player_steals_per_game
Average steals per game
player_blocks_per_game
Average blocks per game
player_turnovers_per_game
Average turnovers per game
player_fouls_per_game
Average fouls per game
player_assist_to_turnover_ratio
Assist to turnover ratio
team
Team name
season_type
Season type (regular season, playoffs, etc.)
season_year
Season year
timestamp
Data timestamp

3. Python Library Import and Data Reading

 
 

4. Data Preview and Preprocessing

There are duplicate values in player_name because they represent different seasons. The data was originally desensitized with characters replaced by "█". This is normal - we can find players' real names through their URLs:
notion image
 
However, I noticed that the desensitization wasn't done well. Some player names can still be deduced from partial information, and each URL contains the player's name, like: https://www.espn.com/nba/player/_/id/2377/chris-duhon?year=2008-09&team=NY.
We can directly see this player is Chris Duhon, so we can extract players' real names from the URLs.
OK, looks like the processing is complete. We now have the real player names displayed. Let's check if there are any features with very few categories that might contain useless information.
 
In the season_type column, 80.9% are marked as "Not clear", meaning these players don't have a clear season type classification. The timestamp column shows no special significance, with 45606 accounting for a large proportion. Therefore, we'll consider removing these two features along with the URL.
We also need to process season_year. Initially, I thought it was year-month, but it actually refers to year-year, for example, 2004-05 should indicate 2004. Let's check if there are any cases where the span is greater than 1 year.
After confirming there are no issues, we'll keep only the starting year data.
These duplicates need to be removed. Although they were originally independent URLs, after processing, there are still many duplicate values, indicating that these duplicates represent the same person's data in the same season, with identical metrics. Therefore, they need to be removed.
notion image
 
Since there will be some outstanding players, we won't process these outliers for now, as the data comes from the globally renowned sports news and information website espn.com. Now we can begin the analysis.
 

5. Descriptive Analysis

notion image
[Image URL: ]
notion image
Overall, a player's points per game shows minimal correlation with assist-to-turnover ratio and average blocks per game.

6. Cluster Analysis

 
notion image
Here we can choose either 3 or 5 clusters, as the elbow plot clearly flattens at 3 and 5, and these points also show relatively high silhouette scores. Let's try PCA dimensionality reduction to see which works better between 3 and 5 clusters.
notion image
It seems that with 5 clusters, the boundaries between clusters 2 and 3 are quite fuzzy, making the 3-cluster solution appear more effective.
 
 
notion image
notion image
 
Cluster 0:
  • Higher games played but not team starters
  • Moderate minutes per game, typical of bench players
  • Lower scoring, likely defensive or role players
  • Moderate defensive rebounds but weak offensive rebounds
  • Limited assist numbers, primarily defensive focus
  • Average steals and blocks
  • Good ball control with moderate turnovers
  • Moderate fouls showing defensive pressure
  • Limited assist-to-turnover ratio despite low turnovers
Cluster 1:
  • High games played, usually team starters
  • Longer playing time, carrying significant team responsibility
  • Outstanding scoring ability, team's main scoring option
  • Active in both offensive and defensive rebounds
  • Good playmaking ability with high assists
  • Decent defensive stats but average steals
  • Higher turnovers despite good scoring and assists
  • Higher fouls needing better control
  • Balanced assist-to-turnover ratio but higher turnovers
Cluster 2:
  • Lower games played, likely role players or rotation players
  • Rarely starters with limited playing time
  • Low scoring ability, dependent on other players
  • Limited rebounding contribution
  • Low assist numbers, lacking organizational skills
  • Poor defensive performance with limited steals
  • Almost no shot-blocking ability
  • Low turnovers but limited overall contribution
  • Fewer fouls showing more conservative play
  • Good ball control relative to assist numbers
 
 

7. Analysis of Factors Affecting Player Scoring

7.1 Visual Analysis

notion image
 
Overall, looking at the results, player scoring per game shows minimal correlation with assist-to-turnover ratio and average blocks per game.

7.2 Spearman Correlation Analysis

[Image URL: ]
notion image
Through Spearman correlation analysis, we can see that a player's average points per game shows significant positive correlation with average playing time, games played, games started, average offensive rebounds, average defensive rebounds, average total rebounds, average assists, average steals, average turnovers, and average fouls. It shows moderate to weak positive correlation with average blocks and assist-to-turnover ratio. When a player is particularly skilled, they will naturally have more playing time and appearances than others, and all these statistics are closely related.

7.3 Random Forest Regression Model

The error is around 2-3 points in scoring, with an acceptable R-squared value. This prediction result is satisfactory.
 

7.4 XGBoost Regression Model

The XGBoost and Random Forest predictions are very similar.

8. Conclusion

Based on the comprehensive analysis of 1000 NBA player data records, this project has reached the following main conclusions:
  1. Data Cleaning: The original data had player names that were desensitized, but the processing was not rigorous enough. By extracting URL information, we successfully obtained players' real names and removed duplicate data, ensuring the accuracy and reliability of subsequent analysis.
  1. Descriptive Analysis: The data covers the period from 1990 to 2024, with earlier records being relatively scarce. Since 2005, the data has stabilized year by year, with generally more than 25 entries collected annually. Overall, most players' performance is relatively mediocre, showing a clear left-skewed distribution.
  1. Cluster Analysis: Using K-Means clustering method, player data was divided into three categories:
      • Cluster 0: These players show balanced performance. Although scoring is low, they have good turnover control and are suitable for backup roles or role players in the team.
      • Cluster 1: These players show the most outstanding performance, with high scores, assists, and rebounds, forming the core of their teams.
      • Cluster 2: These players have low participation, poor scoring, and other statistical data, possibly being marginal players or those recovering from injuries.
  1. Analysis of Factors Affecting Player Scoring: Analysis shows that players' average points per game are significantly positively correlated with: average playing time, games played, games started, average offensive rebounds, average defensive rebounds, average total rebounds, average assists, average steals, average turnovers, and average fouls. Additionally, scoring shows moderate to weak positive correlation with average blocks and assist-to-turnover ratio.
  1. Machine Learning: By selecting relevant features, Random Forest Regression and XGBoost Regression models were built. Both models have prediction errors around 5.2 points and R-squared values of 0.84, showing good performance and able to accurately predict players' scoring performance in a given season.
 
 
Loading...

© Kai Zhang 2024-2025