Baseball Metrics: Correlating Runs Scored

For this analysis, we will evaluate 1) Batting Average, 2) On-Base Percentage, 3) Slugging, and 4) On-Base Plus Slugging and determine which metric best correlates to runs scored.

First, it is important to note the definition of each metric and how they are calculated.

Batting Average (BA) is hits divided by number of at-bats. An average over .300 is considered superb, and generally, the season’s batting average leader will hit around .340. The range for BA is 0.0 to 1.0.

On-Base Percentage (OBP) is batting average plus walks, hit by pitches, and reaching base by fielding error. The range for OBP is 0.0 to 1.0. An OBP of .340 is considered good.

Slugging (SLG) is similar to batting average, but it takes the total bases divided by the number of at-bats. Unlike batting average, slugging gives more weight to extra-base hits, such as doubles and home runs, relative to singles. It is measured on a scale of 0 to 4. A slugging percentage of .450 is considered good, meaning the player averages almost 1/2 base for every at-bat.

On-Base Plus Slugging (OPS) is exactly as described; it takes on-base percentage and adds slugging. It is measured on a scale of 0 to 5. It represents a player’s ability to get on base and hit for power. An OPS of .900 is considered excellent.

So, which metric best correlates to runs scored?

Let’s find out!

For this analysis, I downloaded the 2019 Major League Baseball Team stats from the baseball reference website.

I will use Python to import the data, graph the data, and then calculate Pearson’s correlation.

R2 is used to evaluate the quality of fit of a model on data. It expresses what fraction of the variability of your dependent variable (Y) is explained by your independent variable (X).

MLB Stats, Scores, History, & Records | Baseball-Reference.com

Batting Average

On-Base Percentage

Slugging

On-Base Plus Slugging

Conclusion

On-Base Percentage, Slugging, and On-Base Plus Slugging all correlate well to runs scored with a Pearson’s R2 correlation between .87 and .93.

Batting average has the least correlation at .59.

Metric	r2
Batting Average	.60
On-Base Percentage	.88
Slugging	.91
On-Base Plus Slugging	.94

Here is the code in full if you want to copy and paste. The file is available to download the team stats as well.

TeamStats2019Download

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

df = pd.read_csv('TeamStats2019.csv')
df2 = df[['Team','Year','BA','OBP','SLG','OPS','R']]
df2 = df2.rename(columns = {"R":"Runs"})

# Batting Average
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.BA, df2.Runs)
plt.title('Runs Scored by Batting Average (BA)')
sns.regplot(x=df2.BA, y=df2.Runs)
plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes)
plt.show()

# On-Base Percentage
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OBPimport pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

df = pd.read_csv('TeamStats2019.csv')
df2 = df[['Team','Year','BA','OBP','SLG','OPS','R']]
df2 = df2.rename(columns = {"R":"Runs"})

# Batting Average
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.BA, df2.Runs)
plt.title('Runs Scored by Batting Average (BA)')
sns.regplot(x=df2.BA, y=df2.Runs)
plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes)
plt.savefig('mlb-correlating-runs-scored-batting-average.png', dpi=300, bbox_inches='tight')
plt.show()

# On-Base Percentage
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OBP, df2.Runs)
plt.title('Runs Scored by On-Base Percentage (OBP)')
sns.regplot(x=df2.OBP, y=df2.Runs)
plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes)
plt.savefig('mlb-correlating-runs-scored-on-base-percentage.png', dpi=300, bbox_inches='tight')
plt.show()

# Slugging
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.SLG, df2.Runs)
plt.title('Runs Scored by Slugging (SLG)')
sns.regplot(x=df2.SLG, y=df2.Runs)
plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes)
plt.savefig('mlb-correlating-runs-scored-slugging.png', dpi=300, bbox_inches='tight')
plt.show()

# On-Base Plus Slugging
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OPS, df2.Runs)
plt.title('Runs Scored by On-Base Plus Slugging (OPS)')
sns.regplot(x=df2.OPS, y=df2.Runs)
plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes)
plt.savefig('mlb-correlating-runs-scored-on-base-plus-slugging.png', dpi=300, bbox_inches='tight')
plt.show(), df2.Runs)
plt.title('Runs Scored by On-Base Percentage (OBP)')
sns.regplot(x=df2.OBP, y=df2.Runs)
plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes)
plt.show()

# Slugging
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.SLG, df2.Runs)
plt.title('Runs Scored by Slugging (SLG)')
sns.regplot(x=df2.SLG, y=df2.Runs)
plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes)
plt.show()

# On-Base Plus Slugging
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OPS, df2.Runs)
plt.title('Runs Scored by On-Base Plus Slugging (OPS)')
sns.regplot(x=df2.OPS, y=df2.Runs)
plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes)
plt.show()