The following dataset was excerpted from the Data Camp course “Statistical Thinking in Python”. I’ve added my own analysis and have hopefully clarified a few things along the way.

In this analysis we will look at the correlation between female literacy and fertility. Fertility is defined as the average number of children born per women.

First let’s import a few modules, create the function pearson_r, create the arrays, do some quick EDA, and then compute the Pearson correlation coefficient.

```
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
def pearson_r(x, y):
"""Compute Pearson correlation coefficient between two arrays."""
# Compute correlation matrix: corr_mat
corr_mat = np.corrcoef(x,y)
# Return entry [0,1]
return corr_mat[0,1]
illiteracy = np.array([ 9.5, 49.2, 1. , 11.2, 9.8, 60. , 50.2, 51.2, 0.6, 1. , 8.5,
6.1, 9.8, 1. , 42.2, 77.2, 18.7, 22.8, 8.5, 43.9, 1. , 1. ,
1.5, 10.8, 11.9, 3.4, 0.4, 3.1, 6.6, 33.7, 40.4, 2.3, 17.2,
0.7, 36.1, 1. , 33.2, 55.9, 30.8, 87.4, 15.4, 54.6, 5.1, 1.1,
10.2, 19.8, 0. , 40.7, 57.2, 59.9, 3.1, 55.7, 22.8, 10.9, 34.7,
32.2, 43. , 1.3, 1. , 0.5, 78.4, 34.2, 84.9, 29.1, 31.3, 18.3,
81.8, 39. , 11.2, 67. , 4.1, 0.2, 78.1, 1. , 7.1, 1. , 29. ,
1.1, 11.7, 73.6, 33.9, 14. , 0.3, 1. , 0.8, 71.9, 40.1, 1. ,
2.1, 3.8, 16.5, 4.1, 0.5, 44.4, 46.3, 18.7, 6.5, 36.8, 18.6,
11.1, 22.1, 71.1, 1. , 0. , 0.9, 0.7, 45.5, 8.4, 0. , 3.8,
8.5, 2. , 1. , 58.9, 0.3, 1. , 14. , 47. , 4.1, 2.2, 7.2,
0.3, 1.5, 50.5, 1.3, 0.6, 19.1, 6.9, 9.2, 2.2, 0.2, 12.3,
4.9, 4.6, 0.3, 16.5, 65.7, 63.5, 16.8, 0.2, 1.8, 9.6, 15.2,
14.4, 3.3, 10.6, 61.3, 10.9, 32.2, 9.3, 11.6, 20.7, 6.5, 6.7,
3.5, 1. , 1.6, 20.5, 1.5, 16.7, 2. , 0.9])
fertility = np.array([1.769, 2.682, 2.077, 2.132, 1.827, 3.872, 2.288, 5.173, 1.393,
1.262, 2.156, 3.026, 2.033, 1.324, 2.816, 5.211, 2.1 , 1.781,
1.822, 5.908, 1.881, 1.852, 1.39 , 2.281, 2.505, 1.224, 1.361,
1.468, 2.404, 5.52 , 4.058, 2.223, 4.859, 1.267, 2.342, 1.579,
6.254, 2.334, 3.961, 6.505, 2.53 , 2.823, 2.498, 2.248, 2.508,
3.04 , 1.854, 4.22 , 5.1 , 4.967, 1.325, 4.514, 3.173, 2.308,
4.62 , 4.541, 5.637, 1.926, 1.747, 2.294, 5.841, 5.455, 7.069,
2.859, 4.018, 2.513, 5.405, 5.737, 3.363, 4.89 , 1.385, 1.505,
6.081, 1.784, 1.378, 1.45 , 1.841, 1.37 , 2.612, 5.329, 5.33 ,
3.371, 1.281, 1.871, 2.153, 5.378, 4.45 , 1.46 , 1.436, 1.612,
3.19 , 2.752, 3.35 , 4.01 , 4.166, 2.642, 2.977, 3.415, 2.295,
3.019, 2.683, 5.165, 1.849, 1.836, 2.518, 2.43 , 4.528, 1.263,
1.885, 1.943, 1.899, 1.442, 1.953, 4.697, 1.582, 2.025, 1.841,
5.011, 1.212, 1.502, 2.516, 1.367, 2.089, 4.388, 1.854, 1.748,
2.978, 2.152, 2.362, 1.988, 1.426, 3.29 , 3.264, 1.436, 1.393,
2.822, 4.969, 5.659, 3.24 , 1.693, 1.647, 2.36 , 1.792, 3.45 ,
1.516, 2.233, 2.563, 5.283, 3.885, 0.966, 2.373, 2.663, 1.251,
2.052, 3.371, 2.093, 2. , 3.883, 3.852, 3.718, 1.732, 3.928])
```

**Exploratory Data Analysis**

Now let’s create a quick scatter plot and visualize the data.

```
# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
# Set the margins and label axes
plt.margins(.02)
_ = plt.xlabel('Percent illiterate')
_ = plt.ylabel('Fertility')
# Show the plot
plt.show()
```

Looking at the scatter plot it appears there is a correlation between illiteracy and fertility. Let’s dive deeper into this.

Pearson’s correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables.

Here we calculate the correlation coefficient and create a scatter plot with a line of best fit.

To do this we create two graphs. First, we create a simple scatter plot, and then determine the slope and intercept using np.polyfit and overlay this plot with the scatter plot.

```
# Show the Pearson correlation coefficient
print('The Pearson correlation coefficient is ' + str(pearson_r(illiteracy, fertility)))
# Plot the illiteracy rate versus fertility
plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
# Perform a linear regression using np.polyfit(): a, b
slope, intercept = np.polyfit(illiteracy, fertility, 1)
# Print the results to the screen
print('slope =', slope, 'children per woman / percent illiterate')
print('intercept =', intercept, 'children per woman')
# Make theoretical line to plot
x = np.array([0,100])
y = slope * x + intercept
# Add regression line to your plot
_ = plt.plot(x, y)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')
# Draw the plot
plt.show()
```

**Bootstrap Replicates**

Next we will create pairs bootstrap of the literacy/fertility data and compute the 95% confidence interval of the slope.

We also create a histogram of the slope from the bootstrap samples to aid in our validation of our function.

```
def draw_bs_pairs_linreg(x, y, size=1):
"""Perform pairs bootstrap for linear regression."""
# Set up array of indices to sample from: inds
inds = np.arange(len(x))
# Initialize replicates: bs_slope_reps, bs_intercept_reps
bs_slope_reps = np.empty(size)
bs_intercept_reps = np.empty(size)
# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds, size=len(inds))
bs_x, bs_y = x[bs_inds], y[bs_inds]
bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x,bs_y,1)
return bs_slope_reps, bs_intercept_reps
# Generate replicates of slope and intercept using pairs bootstrap
bs_slope_reps, bs_intercept_reps = draw_bs_pairs_linreg(illiteracy, fertility, size=10_000)
# Compute and print 95% CI for slope
print('The 95% confidence interval for the slope is ' + str(np.percentile(bs_slope_reps, [2.5,97.5])))
# Plot the histogram
_ = plt.hist(bs_slope_reps, bins=50)
_ = plt.xlabel('Slope')
_ = plt.ylabel('Count')
plt.show()
```

Now that we have two arrays of bootstrap samples, bs_slope_reps for the slope and bs_intercept_reps for the intercept, we can visualize the confidence interval of the slope.

For performance reasons, we take the first 100 replicates and plot them over a scatter plot.

```
# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])
# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x, bs_slope_reps[i] * x + bs_intercept_reps[i], linewidth=0.5, alpha=0.2, color='red')
# Plot the data
_ = plt.plot(illiteracy, fertility, marker='.',linestyle='none')
# Label axes, set the margins, and show the plot
_ = plt.xlabel('Illiteracy')
_ = plt.ylabel('Fertility')
plt.margins(0.02)
plt.show()
```

It should be noted the above can be performed much easier using Seaborn, as the default setting for Seaborn’s regplot shows the confidence interval in the translucent blue band.

```
ax = sns.regplot(x=illiteracy, y=fertility,marker='.');
ax.set(xlabel='Year', ylabel='Rainfall (inches)')
plt.show()
```

The observed correlation between female illiteracy and fertility may just be by chance. Next we will create a hypothesis that the two are unrelated and fertility is independent of illiteracy.

We will accomplish this by permutating illiteracy and leave the fertility values fixed. For each permutation, we will compute the Pearson correlation coefficient and determine how many of our samples have a Pearson correlation coefficient greater than the observed one.

```
# Compute observed correlation: r_obs
r_obs = pearson_r(illiteracy,fertility)
print('Our observed Pearson R is ' + str(r_obs))
# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(10_000)
# Draw replicates
for i in range(10000):
# Permute illiteracy measurments: illiteracy_permuted
illiteracy_permuted = np.random.permutation(illiteracy)
# Compute Pearson correlation
perm_replicates[i] = pearson_r(illiteracy_permuted,fertility)
plt.hist(perm_replicates, bins=100)
# Compute p-value: p
p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)
print('p-val =', p)
```

Because we got a probability value of zero, this shows that female literacy and fertility are correlated, and we reject the null hypothesis.

**Conclusion**

In this analysis we conclude that fertility and illiteracy are correlated.

We computed a Pearson’s coefficient of correlation of .80, which concludes there is a strong positive correlation between illiteracy and fertility.

By using bootstrap sampling, we can determine confidence intervals of the slope and plot the confidence interval using both Matplotlib and Seaborn.

Finally, from the bootstrap samples we create a hypothesis test and find we cannot recreate our original value of .80 from the bootstrap samples. We reject the null hypothesis that the two variables are unrelated.