For this analysis we are going to evaluate 1) Batting Average, 2) On-Base Percentage, 3) Slugging, and 4) On-Base Plus Slugging and determine which metric best correlates to runs scored.

First, it is important to note the definition of each metric and how they are calculated.

**Batting Average (BA) **is hits divided by number of at-bats. An average over .300 is considered superb, and generally the season’s batting average leader will hit around .340 for the season. The range for BA is 0.0 to 1.0.

**On-Base Percentage (OBP)** is batting average plus it includes walks, hit by pitches, and reaching base by fielding error. The range for OBP is 0.0 to 1.0. An OBP of .340 is considered good.

**Slugging (SLG)** is similar to batting average, but it takes the total bases divided by the number of at-bats. Unlike batting average, slugging gives more weight to extra-base hits such as doubles and home runs, relative to singles. It is measured on a scale of 0 to 4. A slugging percentage of .450 is considered good, meaning that for every at-bat the player is averaging almost 1/2 base.

**On-Base Plus Slugging (OPS)** is exactly as described, it takes on-base percentage and adds slugging. It is measured on a scale of 0 to 5. It represents a player’s ability to get on base and hit for power. An OPS of .900 is considered great.

#### So which metric best correlates to runs scored?

Let’s find out!

For this analysis I downloaded the **2019 Major League Baseball Team** stats from the baseball reference website.

I will be using Python to import the data, graph the data, and calculate Pearson’s correlation.

R2 is used to evaluate the quality of fit of a model on data. It expresses what fraction of the variability of your dependent variable (Y) is explained by your independent variable (X).

MLB Stats, Scores, History, & Records | Baseball-Reference.com

**Batting Average**

**On-Base Percentage**

**Slugging**

**On-Base Plus Slugging**

## Conclusion

On-Base Percentage, Slugging and On-Base Plus Slugging all correlate well to runs scored with a Pearson’s R2 correlation between .87 and .93.

Batting average has the least correlation at .59.

Metric | r2 |

Batting Average | .60 |

On-Base Percentage | .88 |

Slugging | .91 |

On-Base Plus Slugging | .94 |

Here is the code in full if you want to copy and paste. File is available to download the team stats as well.

import pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns from scipy import stats df = pd.read_csv('TeamStats2019.csv') df2 = df[['Team','Year','BA','OBP','SLG','OPS','R']] df2 = df2.rename(columns = {"R":"Runs"}) # Batting Average slope, intercept, r_value, p_value, std_err = stats.linregress(df2.BA, df2.Runs) plt.title('Runs Scored by Batting Average (BA)') sns.regplot(x=df2.BA, y=df2.Runs) plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes) plt.show() # On-Base Percentage slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OBPimport pandas as pd import matplotlib.pyplot as plt import numpy as np import seaborn as sns from scipy import stats df = pd.read_csv('TeamStats2019.csv') df2 = df[['Team','Year','BA','OBP','SLG','OPS','R']] df2 = df2.rename(columns = {"R":"Runs"}) # Batting Average slope, intercept, r_value, p_value, std_err = stats.linregress(df2.BA, df2.Runs) plt.title('Runs Scored by Batting Average (BA)') sns.regplot(x=df2.BA, y=df2.Runs) plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes) plt.savefig('mlb-correlating-runs-scored-batting-average.png', dpi=300, bbox_inches='tight') plt.show() # On-Base Percentage slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OBP, df2.Runs) plt.title('Runs Scored by On-Base Percentage (OBP)') sns.regplot(x=df2.OBP, y=df2.Runs) plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes) plt.savefig('mlb-correlating-runs-scored-on-base-percentage.png', dpi=300, bbox_inches='tight') plt.show() # Slugging slope, intercept, r_value, p_value, std_err = stats.linregress(df2.SLG, df2.Runs) plt.title('Runs Scored by Slugging (SLG)') sns.regplot(x=df2.SLG, y=df2.Runs) plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes) plt.savefig('mlb-correlating-runs-scored-slugging.png', dpi=300, bbox_inches='tight') plt.show() # On-Base Plus Slugging slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OPS, df2.Runs) plt.title('Runs Scored by On-Base Plus Slugging (OPS)') sns.regplot(x=df2.OPS, y=df2.Runs) plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes) plt.savefig('mlb-correlating-runs-scored-on-base-plus-slugging.png', dpi=300, bbox_inches='tight') plt.show(), df2.Runs) plt.title('Runs Scored by On-Base Percentage (OBP)') sns.regplot(x=df2.OBP, y=df2.Runs) plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes) plt.show() # Slugging slope, intercept, r_value, p_value, std_err = stats.linregress(df2.SLG, df2.Runs) plt.title('Runs Scored by Slugging (SLG)') sns.regplot(x=df2.SLG, y=df2.Runs) plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes) plt.show() # On-Base Plus Slugging slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OPS, df2.Runs) plt.title('Runs Scored by On-Base Plus Slugging (OPS)') sns.regplot(x=df2.OPS, y=df2.Runs) plt.text(0.1, 0.9, f'R-squared: {r_value ** 2:.2f}', transform=plt.gca().transAxes) plt.show()