The following dataset was excerpted from the Data Camp course “Statistical Thinking in Python”. I’ve added my own analysis and have hopefully clarified a few things along the way.

In this analysis we will look at the correlation between female literacy and fertility. Fertility is defined as the average number of children born per women.

First let’s import a few modules, create the function pearson_r, create the arrays, do some quick EDA, and then compute the Pearson correlation coefficient.

import numpy as np import matplotlib.pyplot as plt import seaborn as sns def pearson_r(x, y): """Compute Pearson correlation coefficient between two arrays.""" # Compute correlation matrix: corr_mat corr_mat = np.corrcoef(x,y) # Return entry [0,1] return corr_mat[0,1] illiteracy = np.array([ 9.5, 49.2, 1. , 11.2, 9.8, 60. , 50.2, 51.2, 0.6, 1. , 8.5, 6.1, 9.8, 1. , 42.2, 77.2, 18.7, 22.8, 8.5, 43.9, 1. , 1. , 1.5, 10.8, 11.9, 3.4, 0.4, 3.1, 6.6, 33.7, 40.4, 2.3, 17.2, 0.7, 36.1, 1. , 33.2, 55.9, 30.8, 87.4, 15.4, 54.6, 5.1, 1.1, 10.2, 19.8, 0. , 40.7, 57.2, 59.9, 3.1, 55.7, 22.8, 10.9, 34.7, 32.2, 43. , 1.3, 1. , 0.5, 78.4, 34.2, 84.9, 29.1, 31.3, 18.3, 81.8, 39. , 11.2, 67. , 4.1, 0.2, 78.1, 1. , 7.1, 1. , 29. , 1.1, 11.7, 73.6, 33.9, 14. , 0.3, 1. , 0.8, 71.9, 40.1, 1. , 2.1, 3.8, 16.5, 4.1, 0.5, 44.4, 46.3, 18.7, 6.5, 36.8, 18.6, 11.1, 22.1, 71.1, 1. , 0. , 0.9, 0.7, 45.5, 8.4, 0. , 3.8, 8.5, 2. , 1. , 58.9, 0.3, 1. , 14. , 47. , 4.1, 2.2, 7.2, 0.3, 1.5, 50.5, 1.3, 0.6, 19.1, 6.9, 9.2, 2.2, 0.2, 12.3, 4.9, 4.6, 0.3, 16.5, 65.7, 63.5, 16.8, 0.2, 1.8, 9.6, 15.2, 14.4, 3.3, 10.6, 61.3, 10.9, 32.2, 9.3, 11.6, 20.7, 6.5, 6.7, 3.5, 1. , 1.6, 20.5, 1.5, 16.7, 2. , 0.9]) fertility = np.array([1.769, 2.682, 2.077, 2.132, 1.827, 3.872, 2.288, 5.173, 1.393, 1.262, 2.156, 3.026, 2.033, 1.324, 2.816, 5.211, 2.1 , 1.781, 1.822, 5.908, 1.881, 1.852, 1.39 , 2.281, 2.505, 1.224, 1.361, 1.468, 2.404, 5.52 , 4.058, 2.223, 4.859, 1.267, 2.342, 1.579, 6.254, 2.334, 3.961, 6.505, 2.53 , 2.823, 2.498, 2.248, 2.508, 3.04 , 1.854, 4.22 , 5.1 , 4.967, 1.325, 4.514, 3.173, 2.308, 4.62 , 4.541, 5.637, 1.926, 1.747, 2.294, 5.841, 5.455, 7.069, 2.859, 4.018, 2.513, 5.405, 5.737, 3.363, 4.89 , 1.385, 1.505, 6.081, 1.784, 1.378, 1.45 , 1.841, 1.37 , 2.612, 5.329, 5.33 , 3.371, 1.281, 1.871, 2.153, 5.378, 4.45 , 1.46 , 1.436, 1.612, 3.19 , 2.752, 3.35 , 4.01 , 4.166, 2.642, 2.977, 3.415, 2.295, 3.019, 2.683, 5.165, 1.849, 1.836, 2.518, 2.43 , 4.528, 1.263, 1.885, 1.943, 1.899, 1.442, 1.953, 4.697, 1.582, 2.025, 1.841, 5.011, 1.212, 1.502, 2.516, 1.367, 2.089, 4.388, 1.854, 1.748, 2.978, 2.152, 2.362, 1.988, 1.426, 3.29 , 3.264, 1.436, 1.393, 2.822, 4.969, 5.659, 3.24 , 1.693, 1.647, 2.36 , 1.792, 3.45 , 1.516, 2.233, 2.563, 5.283, 3.885, 0.966, 2.373, 2.663, 1.251, 2.052, 3.371, 2.093, 2. , 3.883, 3.852, 3.718, 1.732, 3.928])

**Exploratory Data Analysis**

Now let’s create a quick scatter plot and visualize the data.

import matplotlib.pyplot as plt # Plot the illiteracy rate versus fertility _ = plt.plot(illiteracy, fertility, marker='.', linestyle='none') # Set the margins and label axes plt.margins(.02) _ = plt.xlabel('Percent Illiterate') _ = plt.ylabel('Fertility') # Add title to the plot _ = plt.title('Scatter Plot of Illiteracy Rate vs Fertility') # Show the plot plt.savefig('literacy-vs-Fertility-scatter-plot.png', dpi=300, bbox_inches='tight') plt.show()

Looking at the scatter plot it appears there is a correlation between illiteracy and fertility. Let’s dive deeper into this.

Pearson’s correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables.

Here we calculate the correlation coefficient and create a scatter plot with a line of best fit.

To do this we create two graphs. First, we create a simple scatter plot, and then determine the slope and intercept using np.polyfit and overlay this plot with the scatter plot.

import matplotlib.pyplot as plt import numpy as np # Show the Pearson correlation coefficient print('The Pearson correlation coefficient is ' + str(pearson_r(illiteracy, fertility))) # Plot the illiteracy rate versus fertility plt.plot(illiteracy, fertility, marker='.', linestyle='none') plt.margins(0.02) # Perform a linear regression using np.polyfit(): a, b slope, intercept = np.polyfit(illiteracy, fertility, 1) # Print the results to the screen print('slope =', slope, 'children per woman / percent illiterate') print('intercept =', intercept, 'children per woman') # Make theoretical line to plot x = np.array([0,100]) y = slope * x + intercept # Add regression line to your plot _ = plt.plot(x, y) # Set the title and label axes _ = plt.title('Scatter Plot and Linear Regression of Illiteracy Rate vs Fertility') _ = plt.xlabel('Percent Illiterate') _ = plt.ylabel('Fertility') # Save and show the plot plt.savefig('literacy-vs-fertility-scatter-plot-with-slope.png', dpi=300, bbox_inches='tight') plt.show()

```
The Pearson correlation coefficient is 0.8041324026815341
slope = 0.04979854809063423 children per woman / percent illiterate
intercept = 1.888050610636557 children per woman
```

**Bootstrap Replicates**

Next we will create pairs bootstrap of the literacy/fertility data and compute the 95% confidence interval of the slope.

We also create a histogram of the slope from the bootstrap samples to aid in our validation of our function.

def draw_bs_pairs_linreg(x, y, size=1): """Perform pairs bootstrap for linear regression.""" # Set up array of indices to sample from inds inds = np.arange(len(x)) # Initialize replicates: bs_slope_reps, bs_intercept_reps bs_slope_reps = np.empty(size) bs_intercept_reps = np.empty(size) # Generate replicates for i in range(size): bs_inds = np.random.choice(inds, size=len(inds)) bs_x, bs_y = x[bs_inds], y[bs_inds] bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x,bs_y,1) return bs_slope_reps, bs_intercept_reps # Generate replicates of slope and intercept using pairs bootstrap bs_slope_reps, bs_intercept_reps = draw_bs_pairs_linreg(illiteracy, fertility, size=10_000) # Compute and print 95% CI for slope print('The 95% confidence interval for the slope is ' + str(np.percentile(bs_slope_reps, [2.5,97.5]))) # Plot the histogram _ = plt.hist(bs_slope_reps, bins=50) _ = plt.title('Confidence Interval of the Slope') _ = plt.xlabel('Slope') _ = plt.ylabel('Count') plt.savefig('literacy-vs-fertility-histogram.png', dpi=300, bbox_inches='tight') plt.show()

`The 95% confidence interval for the slope is [0.04412229 0.05554294]`

Now that we have two arrays of bootstrap samples, bs_slope_reps for the slope and bs_intercept_reps for the intercept, we can visualize the confidence interval of the slope.

For performance reasons, we take the first 100 replicates and plot them over a scatter plot.

# Generate array of x-values for bootstrap lines: x x = np.array([0,100]) # Plot the bootstrap lines for i in range(100): _ = plt.plot(x, bs_slope_reps[i] * x + bs_intercept_reps[i], linewidth=0.5, alpha=0.2, color='red') # Plot the data _ = plt.plot(illiteracy, fertility, marker='.',linestyle='none') # Label axes, set the margins, and show the plot _ = plt.title('Confidence Interval of the Slope Visualized') _ = plt.xlabel('Illiteracy') _ = plt.ylabel('Fertility') plt.margins(0.02) plt.savefig('literacy-vs-fertility-ci-slope.png', dpi=300, bbox_inches='tight') plt.show()

It should be noted the above can be performed much easier using Seaborn, as the default setting for Seaborn’s regplot shows the confidence interval in the translucent blue band.

ax = sns.regplot(x=illiteracy, y=fertility,marker='.'); ax.set(xlabel='Illiteracy', ylabel='Fertility',title='Seaborn Confidence Interval of the Slope Visualized') plt.savefig('literacy-vs-fertility-seaborn-ci-slope.png', dpi=300, bbox_inches='tight') plt.show()

The observed correlation between female illiteracy and fertility may just be by chance. Next we will create a hypothesis that the two are unrelated and fertility is independent of illiteracy.

We will accomplish this by permutating illiteracy and leave the fertility values fixed. For each permutation, we will compute the Pearson correlation coefficient and determine how many of our samples have a Pearson correlation coefficient greater than the observed one.

# Compute observed correlation: r_obs r_obs = pearson_r(illiteracy,fertility) print('Our observed Pearson R is ' + str(r_obs)) # Initialize permutation replicates: perm_replicates perm_replicates = np.empty(10_000) # Draw replicates for i in range(10000): # Permute illiteracy measurements: illiteracy_permuted illiteracy_permuted = np.random.permutation(illiteracy) # Compute Pearson correlation perm_replicates[i] = pearson_r(illiteracy_permuted,fertility) _ = plt.title('Pearson Correlation Coefficient Permutation Replicates') _ = plt.xlabel('Pearson Correlation Coefficient') _ = plt.ylabel('Count') plt.hist(perm_replicates, bins=100) plt.savefig('literacy-vs-fertility-pearson-correlation-histogram.png', dpi=300, bbox_inches='tight') # Compute p-value: p p = np.sum(perm_replicates >= r_obs) / len(perm_replicates) print('p-val =', p)

```
Our observed Pearson R is 0.8041324026815341
p-val = 0.0
```

Because we got a probability value of zero, this shows that female literacy and fertility are correlated, and we reject the null hypothesis.

**Conclusion**

In this analysis we conclude that fertility and illiteracy are correlated.

We computed a Pearson’s coefficient of correlation of .80, which concludes there is a strong positive correlation between illiteracy and fertility.

By using bootstrap sampling, we can determine confidence intervals of the slope and plot the confidence interval using both Matplotlib and Seaborn.

Finally, from the bootstrap samples we create a hypothesis test and find we cannot recreate our original value of .80 from the bootstrap samples. We reject the null hypothesis that the two variables are unrelated.