For this analysis we are going to evaluate 1) Batting Average, 2) On-Base Percentage, 3) Slugging, and 4) On-Base Plus Slugging and determine which metric best correlates to runs scored.

First, it is important to note the definition of each metric and how they are calculated.

**Batting Average (BA) **is hits divided by number of at-bats. An average over .300 is considered superb, and generally the season’s batting average leader will hit around .340 for the season. The range for BA is 0.0 to 1.0.

**On-Base Percentage (OBP)** is batting average plus it includes walks, hit by pitches, and reaching base by fielding error. The range for OBP is 0.0 to 1.0. An OBP of .340 is considered good.

**Slugging (SLG)** is similar to batting average, but it takes the total bases divided by the number of at-bats. Unlike batting average, slugging gives more weight to extra-base hits such as doubles and home runs, relative to singles. It is measured on a scale of 0 to 4. A slugging percentage of .450 is considered good, meaning that for every at-bat the player is averaging almost 1/2 base.

**On-Base Plus Slugging (OPS)** is exactly as described, it takes on-base percentage and adds slugging. It is measured on a scale of 0 to 5. It represents a player’s ability to get on base and hit for power. An OPS of .900 is considered great.

#### So which metric best correlates to runs scored?

Let’s find out!

For this analysis I downloaded the 2019 Major League Baseball Team stats from the baseball reference website.

I will be using Python to import the data, graph the data, and calculate Pearson’s correlation. R2 is used to evaluate the quality of fit of a model on data. It expresses what fraction of the variability of your dependent variable (Y) is explained by your independent variable (X).

MLB Stats, Scores, History, & Records | Baseball-Reference.com

```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
df = pd.read_csv('TeamStats2019.csv')
df2 = df[['Team','Year','BA','OBP','SLG','OPS','R']]
df2 = df2.rename(columns = {"R":"Runs"})
```

**Batting Average**

```
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.BA,df2.Runs)
plt.title('Runs Scored by Batting Average (BA)')
sns.regplot(x = df2.BA, y = df2.Runs)
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2);
```

**On-Base Percentage**

```
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OBP,df2.Runs)
plt.title('Runs Scored by On-Base Percentage (OBP)')
sns.regplot(x = df2.OBP, y = df2.Runs)
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)
```

**Slugging**

```
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.SLG,df2.Runs)
plt.title('Runs Scored by Slugging (SLG)')
sns.regplot(x = df2.SLG, y = df2.Runs)
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)
```

**On-Base Plus Slugging**

```
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OPS,df2.Runs)
plt.title('Runs Scored by On-Base Plus Slugging (OPS)')
sns.regplot(x = df2.OPS, y = df2.Runs)
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)
```

## Conclusion

On-Base Percentage, Slugging and On-Base Plus Slugging all correlate well to runs scored with a Pearson’s R2 correlation between .87 and .93. Batting average has the least correlation at .59.

Metric | r2 |

Batting Average | .597 |

On-Base Percentage | .878 |

Slugging | .914 |

On-Base Plus Slugging | .937 |

Here is the code in full if you want to copy and paste.

```
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy import stats
df = pd.read_csv('TeamStats2019.csv')
df2 = df[['Team','Year','BA','OBP','SLG','OPS','R']]
df2 = df2.rename(columns = {"R":"Runs"})
#Batting Average
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.BA,df2.Runs)
plt.title('Runs Scored by Batting Average (BA)')
sns.regplot(x = df2.BA, y = df2.Runs)
plt.show()
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2);
#On-Base Percentage
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OBP,df2.Runs)
plt.title('Runs Scored by On-Base Percentage (OBP)')
sns.regplot(x = df2.OBP, y = df2.Runs)
plt.show()
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.SLG,df2.Runs)
plt.title('Runs Scored by Slugging (SLG)')
sns.regplot(x = df2.SLG, y = df2.Runs)
plt.show()
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)
slope, intercept, r_value, p_value, std_err = stats.linregress(df2.OPS,df2.Runs)
plt.title('Runs Scored by On-Base Plus Slugging (OPS)')
sns.regplot(x = df2.OPS, y = df2.Runs)
plt.show()
print('--------------------------------------------------')
print('The value of R2 is:', r_value**2)
```