Baseball Metrics: Correlating Runs Scored

For this analysis we are going to evaluate 1) Batting Average, 2) On-Base Percentage, 3) Slugging, and 4) On-Base Plus Slugging and determine which metric best correlates to runs scored.

First, it is important to note the definition of each metric and how they are calculated.

Batting Average (BA) is hits divided by number of at-bats. An average over .300 is considered superb, and generally the season’s batting average leader will hit around .340 for the season. The range for BA is 0.0 to 1.0.

On-Base Percentage (OBP) is batting average plus it includes walks, hit by pitches, and reaching base by fielding error. The range for OBP is 0.0 to 1.0. An OBP of .340 is considered good.

Slugging (SLG) is similar to batting average, but it takes the total bases divided by the number of at-bats. Unlike batting average, slugging gives more weight to extra-base hits such as doubles and home runs, relative to singles. It is measured on a scale of 0 to 4. A slugging percentage of .450 is considered good, meaning that for every at-bat the player is averaging almost 1/2 base.

On-Base Plus Slugging (OPS) is exactly as described, it takes on-base percentage and adds slugging. It is measured on a scale of 0 to 5. It represents a player’s ability to get on base and hit for power. An OPS of .900 is considered great.

So which metric best correlates to runs scored?

Let’s find out!


For this analysis I downloaded the 2019 Major League Baseball Team stats from the baseball reference website.

I will be using Python to import the data, graph the data, and calculate Pearson’s correlation. R2 is used to evaluate the quality of fit of a model on data. It expresses what fraction of the variability of your dependent variable (Y) is explained by your independent variable (X).

MLB Stats, Scores, History, & Records | Baseball-Reference.com


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

df = pd.read_csv('TeamStats2019.csv')
df2 = df[['Team','Year','BA','OBP','SLG','OPS','R']]
df2 = df2.rename(columns = {"R":"Runs"})

Batting Average

slope, intercept, r_value, p_value, std_err = stats.linregress(df2.BA,df2.Runs)
plt.title('Runs Scored by Batting Average (BA)')
sns.regplot(x = df2.BA, y = df2.Runs)
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2);

On-Base Percentage

slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OBP,df2.Runs)
plt.title('Runs Scored by On-Base Percentage (OBP)')
sns.regplot(x = df2.OBP, y = df2.Runs)
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)

Slugging

slope, intercept, r_value, p_value, std_err = stats.linregress(df2.SLG,df2.Runs)
plt.title('Runs Scored by Slugging (SLG)')
sns.regplot(x = df2.SLG, y = df2.Runs)
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)

On-Base Plus Slugging

slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OPS,df2.Runs)
plt.title('Runs Scored by On-Base Plus Slugging (OPS)')
sns.regplot(x = df2.OPS, y = df2.Runs)
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)

Conclusion

On-Base Percentage, Slugging and On-Base Plus Slugging all correlate well to runs scored with a Pearson’s R2 correlation between .87 and .93. Batting average has the least correlation at .59.

Metricr2
Batting Average.597
On-Base Percentage.878
Slugging.914
On-Base Plus Slugging.937

Here is the code in full if you want to copy and paste.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats

df = pd.read_csv('TeamStats2019.csv')
df2 = df[['Team','Year','BA','OBP','SLG','OPS','R']]
df2 = df2.rename(columns = {"R":"Runs"})

#Batting Average
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.BA,df2.Runs)
plt.title('Runs Scored by Batting Average (BA)')
sns.regplot(x = df2.BA, y = df2.Runs)
plt.show()
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2);


#On-Base Percentage
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OBP,df2.Runs)
plt.title('Runs Scored by On-Base Percentage (OBP)')
sns.regplot(x = df2.OBP, y = df2.Runs)
plt.show()
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)


slope, intercept, r_value, p_value, std_err = stats.linregress(df2.SLG,df2.Runs)
plt.title('Runs Scored by Slugging (SLG)')
sns.regplot(x = df2.SLG, y = df2.Runs)
plt.show()
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)

slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OPS,df2.Runs)
plt.title('Runs Scored by On-Base Plus Slugging (OPS)')
sns.regplot(x = df2.OPS, y = df2.Runs)
plt.show()
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)