Simulating a Basketball Game - Advanced SQL Puzzles

In this study, we simulate two distinct scenarios involving basketball teams: one team exclusively shoots three-pointers, while the other focuses solely on two-pointers. We then visualize the data in a graph and evaluate which team exhibits a higher win percentage. Finally, we conduct a two-sample t-test to ascertain whether the two teams’ average win percentages are statistically different.

Using average NBA shooting statistics from the 2022–2023 season, the three-point shooting success rate is 36.1%, and the two-point shooting success rate is 47.5%. The average number of shot attempts per game is 90. These figures will be used for my analysis.

ℹ️ Note: The Python script for this simulation is located at the end of this analysis.

First, we will run a simulation of 10,000 games and determine which team has the higher win percentage.

Based on the results, the team that only takes three-point shots wins 75% of the time.

After 10,000 simulations:
Team shooting only three-pointers won 7555 times
Team shooting only two-pointers won 2258 times
Tied result: 187 times

The data allows us to plot a histogram that illustrates the two teams have different means and standard deviations.

Here is a look at the summary statistics. It should be noted that the average points per game for the 2022-2023 season is 114.7, and both of our simulation averages fall below this mark, which is likely due to not including free-throw attempts in my analysis.

Average points per game for the team shooting three-pointers: 108.18
Standard deviation for the team shooting three-pointers: 14.56
Average points per game for the team shooting two-pointers: 94.99
Standard deviation for the team shooting two-pointers: 9.94

Running a two-sample t-test, we can see the two-tailed P value is less than 0.0001. We reject the null hypothesis and state the two sample sets are statistically significant.

T-statistic: 74.80611223225225
P-value: 0.0
We reject the null hypothesis: there is a significant difference in the means of the two populations.

Conclusion

In the last two decades, three-point attempts have more than doubled, increasing from 14 to 35 attempts per game. This analysis demonstrates that teams with superior three-point shooting skills have an advantage, thereby validating the strategy of prioritizing three-pointers in gameplay.

Here is the Python code I used for this analysis.

# https://www.basketball-reference.com/leagues/NBA_stats_per_game.html

import matplotlib.pyplot as plt
import random
from scipy.stats import ttest_ind
import numpy as np

# Define the number of shots to take and the shooting percentages for each team
num_shots = 100
three_pct = 0.361
two_pct = 0.475

# Simulate shooting 90 shots for each team 100,000 times and keep track of how many times each team wins
num_games = 10000
team_three_wins = 0
team_two_wins = 0
tied_result = 0
for i in range(num_games):
    team_three_score = 0
    team_two_score = 0
    for j in range(num_shots):
        if random.random() <= three_pct:
            team_three_score += 3
        if random.random() <= two_pct:
            team_two_score += 2
    if team_three_score > team_two_score:
        team_three_wins += 1
    elif team_two_score > team_three_score:
        team_two_wins += 1
    elif team_two_score == team_three_score:
        tied_result += 1

# Print the results
print("After {} simulations:".format(num_games))
print("Team shooting only three-pointers won {} times".format(team_three_wins))
print("Team shooting only two-pointers won {} times".format(team_two_wins))
print("Tied result: {} times".format(tied_result))

# Function to simulate a game with only three-point shots
def simulate_game_three():
    made_shots_three = 0
    for _ in range(num_shots):
        if random.random() < three_pct:
            made_shots_three += 1
    return made_shots_three * 3

# Function to simulate a game with only three-point shots
def simulate_game_two():
    made_shots_two = 0
    for _ in range(num_shots):
        if random.random() < two_pct:
            made_shots_two += 1
    return made_shots_two * 2

# Run the simulation and record the total points scored in each game
game_scores_three = [simulate_game_three() for _ in range(num_games)]
game_scores_two = [simulate_game_two() for _ in range(num_games)]


# Create a histogram of the total points scored
bins_three = range(min(game_scores_three), max(game_scores_three)+1, 3)
bins_two = range(min(game_scores_two), max(game_scores_two)+1, 2)
plt.hist(game_scores_three, bins=bins_three, color='tab:blue', alpha=0.5, label="Three-point shots", edgecolor='black')
plt.hist(game_scores_two, bins=bins_two, color='tab:orange', alpha=0.5, label="Two-point shots", edgecolor='black')
plt.xlabel("Total Points Scored")
plt.ylabel("Frequency")
plt.title("Histogram Comparison of Total Points Scored")
plt.legend()
plt.savefig('basketball-simulation-histogram.png', dpi=300, bbox_inches='tight')
plt.show()

# Perform the two-sample t-test
t_statistic, p_value = ttest_ind(game_scores_three, game_scores_two)

# Print the results
print("T-statistic:", t_statistic)
print("P-value:", p_value)

mean_three = np.mean(game_scores_three)
mean_two = np.mean(game_scores_two)
std_dev_three = np.std(game_scores_three, ddof=1)  # Using ddof=1 for sample standard deviation
std_dev_two = np.std(game_scores_two, ddof=1)

# Print the mean and standard deviation of each dataset
print("Average points per game for the team shooting three-pointers: {:.2f}".format(mean_three))
print("Standard deviation for the team shooting three-pointers: {:.2f}".format(std_dev_three))
print("Average points per game for the team shooting two-pointers: {:.2f}".format(mean_two))
print("Standard deviation for the team shooting two-pointers: {:.2f}".format(std_dev_two))

# Determine if the null hypothesis can be rejected
alpha = 0.05  # Set a significance level
if p_value < alpha:
    print("We reject the null hypothesis: there is a significant difference in the means of the two populations.")
else:
    print("We cannot reject the null hypothesis: there is no significant difference in the means of the two populations.")