NBA Player Stats Forecast

type

status

date

slug

summary

1. Project Background

With the increasing intensity of NBA games and global attention, players' seasonal performance data has become an important basis for fans, analysts, and team decision-makers to evaluate player abilities and game trends. Based on NBA player game data from a certain season, this project conducts comprehensive data analysis to reveal patterns and trends in player performance, explore characteristics of different player types, and provide data support for team personnel selection, tactical deployment, and future season predictions.

Dataset download link: https://www.heywhale.com/mw/dataset/674416242f5b14a07a1fda23

2. Data Description

Field	Description
URL	Player statistics page URL
player_name	Player name
player_games_played	Number of games played
player_games_started	Number of games started
player_minutes_per_game	Average minutes played per game
player_points_per_game	Average points scored per game
player_offensive_rebounds_per_game	Average offensive rebounds per game
player_defensive_rebounds_per_game	Average defensive rebounds per game
player_rebounds_per_game	Average total rebounds per game
player_assists_per_game	Average assists per game
player_steals_per_game	Average steals per game
player_blocks_per_game	Average blocks per game
player_turnovers_per_game	Average turnovers per game
player_fouls_per_game	Average fouls per game
player_assist_to_turnover_ratio	Assist to turnover ratio
team	Team name
season_type	Season type (regular season, playoffs, etc.)
season_year	Season year
timestamp	Data timestamp

3. Python Library Import and Data Reading

4. Data Preview and Preprocessing

There are duplicate values in player_name because they represent different seasons. The data was originally desensitized with characters replaced by "█". This is normal - we can find players' real names through their URLs:

However, I noticed that the desensitization wasn't done well. Some player names can still be deduced from partial information, and each URL contains the player's name, like: https://www.espn.com/nba/player/_/id/2377/chris-duhon?year=2008-09&team=NY.

Chris Duhon - Los Angeles Lakers Point Guard - ESPN

View the profile of Los Angeles Lakers Point Guard Chris Duhon on ESPN. Get the latest news, live stats and game highlights.

https://www.espn.com/nba/player/_/id/2377/chris-duhon?year=2008-09&team=NY

Chris Duhon - Los Angeles Lakers Point Guard - ESPN

We can directly see this player is Chris Duhon, so we can extract players' real names from the URLs.

OK, looks like the processing is complete. We now have the real player names displayed. Let's check if there are any features with very few categories that might contain useless information.

In the season_type column, 80.9% are marked as "Not clear", meaning these players don't have a clear season type classification. The timestamp column shows no special significance, with 45606 accounting for a large proportion. Therefore, we'll consider removing these two features along with the URL.

We also need to process season_year. Initially, I thought it was year-month, but it actually refers to year-year, for example, 2004-05 should indicate 2004. Let's check if there are any cases where the span is greater than 1 year.

After confirming there are no issues, we'll keep only the starting year data.

These duplicates need to be removed. Although they were originally independent URLs, after processing, there are still many duplicate values, indicating that these duplicates represent the same person's data in the same season, with identical metrics. Therefore, they need to be removed.

Since there will be some outstanding players, we won't process these outliers for now, as the data comes from the globally renowned sports news and information website espn.com. Now we can begin the analysis.

5. Descriptive Analysis

[Image URL: ]

Overall, a player's points per game shows minimal correlation with assist-to-turnover ratio and average blocks per game.

6. Cluster Analysis

Here we can choose either 3 or 5 clusters, as the elbow plot clearly flattens at 3 and 5, and these points also show relatively high silhouette scores. Let's try PCA dimensionality reduction to see which works better between 3 and 5 clusters.

It seems that with 5 clusters, the boundaries between clusters 2 and 3 are quite fuzzy, making the 3-cluster solution appear more effective.

Cluster 0:

Higher games played but not team starters

Moderate minutes per game, typical of bench players

Lower scoring, likely defensive or role players

Moderate defensive rebounds but weak offensive rebounds

Limited assist numbers, primarily defensive focus

Average steals and blocks

Good ball control with moderate turnovers

Moderate fouls showing defensive pressure

Limited assist-to-turnover ratio despite low turnovers

Cluster 1:

High games played, usually team starters

Longer playing time, carrying significant team responsibility

Outstanding scoring ability, team's main scoring option

Active in both offensive and defensive rebounds

Good playmaking ability with high assists

Decent defensive stats but average steals

Higher turnovers despite good scoring and assists

Higher fouls needing better control

Balanced assist-to-turnover ratio but higher turnovers

Cluster 2:

Lower games played, likely role players or rotation players

Rarely starters with limited playing time

Low scoring ability, dependent on other players

Limited rebounding contribution

Low assist numbers, lacking organizational skills

Poor defensive performance with limited steals

Almost no shot-blocking ability

Low turnovers but limited overall contribution

Fewer fouls showing more conservative play

Good ball control relative to assist numbers

7. Analysis of Factors Affecting Player Scoring

7.1 Visual Analysis

Overall, looking at the results, player scoring per game shows minimal correlation with assist-to-turnover ratio and average blocks per game.

7.2 Spearman Correlation Analysis

[Image URL: ]

Through Spearman correlation analysis, we can see that a player's average points per game shows significant positive correlation with average playing time, games played, games started, average offensive rebounds, average defensive rebounds, average total rebounds, average assists, average steals, average turnovers, and average fouls. It shows moderate to weak positive correlation with average blocks and assist-to-turnover ratio. When a player is particularly skilled, they will naturally have more playing time and appearances than others, and all these statistics are closely related.

7.3 Random Forest Regression Model

The error is around 2-3 points in scoring, with an acceptable R-squared value. This prediction result is satisfactory.

7.4 XGBoost Regression Model

The XGBoost and Random Forest predictions are very similar.

8. Conclusion

Based on the comprehensive analysis of 1000 NBA player data records, this project has reached the following main conclusions:

Data Cleaning: The original data had player names that were desensitized, but the processing was not rigorous enough. By extracting URL information, we successfully obtained players' real names and removed duplicate data, ensuring the accuracy and reliability of subsequent analysis.

Descriptive Analysis: The data covers the period from 1990 to 2024, with earlier records being relatively scarce. Since 2005, the data has stabilized year by year, with generally more than 25 entries collected annually. Overall, most players' performance is relatively mediocre, showing a clear left-skewed distribution.

Cluster Analysis: Using K-Means clustering method, player data was divided into three categories:

Cluster 0: These players show balanced performance. Although scoring is low, they have good turnover control and are suitable for backup roles or role players in the team.

Cluster 1: These players show the most outstanding performance, with high scores, assists, and rebounds, forming the core of their teams.

Cluster 2: These players have low participation, poor scoring, and other statistical data, possibly being marginal players or those recovering from injuries.

Analysis of Factors Affecting Player Scoring: Analysis shows that players' average points per game are significantly positively correlated with: average playing time, games played, games started, average offensive rebounds, average defensive rebounds, average total rebounds, average assists, average steals, average turnovers, and average fouls. Additionally, scoring shows moderate to weak positive correlation with average blocks and assist-to-turnover ratio.

Machine Learning: By selecting relevant features, Random Forest Regression and XGBoost Regression models were built. Both models have prediction errors around 5.2 points and R-squared values of 0.84, showing good performance and able to accurately predict players' scoring performance in a given season.