# Literacy vs Fertility

The following dataset was excerpted from the Data Camp course “Statistical Thinking in Python.” I’ve added my own analysis and have hopefully clarified a few things along the way.

In this analysis, we will look at the correlation between female literacy and fertility. Fertility is defined as the average number of children born per woman.

First, let’s import a few modules, create the function pearson_r, create the arrays, do some quick EDA, and then compute the Pearson correlation coefficient.

```import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def pearson_r(x, y):
"""Compute Pearson correlation coefficient between two arrays."""
# Compute correlation matrix: corr_mat
corr_mat = np.corrcoef(x,y)

# Return entry [0,1]
return corr_mat[0,1]

illiteracy = np.array([ 9.5, 49.2,  1. , 11.2,  9.8, 60. , 50.2, 51.2,  0.6,  1. ,  8.5,
6.1,  9.8,  1. , 42.2, 77.2, 18.7, 22.8,  8.5, 43.9,  1. ,  1. ,
1.5, 10.8, 11.9,  3.4,  0.4,  3.1,  6.6, 33.7, 40.4,  2.3, 17.2,
0.7, 36.1,  1. , 33.2, 55.9, 30.8, 87.4, 15.4, 54.6,  5.1,  1.1,
10.2, 19.8,  0. , 40.7, 57.2, 59.9,  3.1, 55.7, 22.8, 10.9, 34.7,
32.2, 43. ,  1.3,  1. ,  0.5, 78.4, 34.2, 84.9, 29.1, 31.3, 18.3,
81.8, 39. , 11.2, 67. ,  4.1,  0.2, 78.1,  1. ,  7.1,  1. , 29. ,
1.1, 11.7, 73.6, 33.9, 14. ,  0.3,  1. ,  0.8, 71.9, 40.1,  1. ,
2.1,  3.8, 16.5,  4.1,  0.5, 44.4, 46.3, 18.7,  6.5, 36.8, 18.6,
11.1, 22.1, 71.1,  1. ,  0. ,  0.9,  0.7, 45.5,  8.4,  0. ,  3.8,
8.5,  2. ,  1. , 58.9,  0.3,  1. , 14. , 47. ,  4.1,  2.2,  7.2,
0.3,  1.5, 50.5,  1.3,  0.6, 19.1,  6.9,  9.2,  2.2,  0.2, 12.3,
4.9,  4.6,  0.3, 16.5, 65.7, 63.5, 16.8,  0.2,  1.8,  9.6, 15.2,
14.4,  3.3, 10.6, 61.3, 10.9, 32.2,  9.3, 11.6, 20.7,  6.5,  6.7,
3.5,  1. ,  1.6, 20.5,  1.5, 16.7,  2. ,  0.9])

fertility = np.array([1.769, 2.682, 2.077, 2.132, 1.827, 3.872, 2.288, 5.173, 1.393,
1.262, 2.156, 3.026, 2.033, 1.324, 2.816, 5.211, 2.1  , 1.781,
1.822, 5.908, 1.881, 1.852, 1.39 , 2.281, 2.505, 1.224, 1.361,
1.468, 2.404, 5.52 , 4.058, 2.223, 4.859, 1.267, 2.342, 1.579,
6.254, 2.334, 3.961, 6.505, 2.53 , 2.823, 2.498, 2.248, 2.508,
3.04 , 1.854, 4.22 , 5.1  , 4.967, 1.325, 4.514, 3.173, 2.308,
4.62 , 4.541, 5.637, 1.926, 1.747, 2.294, 5.841, 5.455, 7.069,
2.859, 4.018, 2.513, 5.405, 5.737, 3.363, 4.89 , 1.385, 1.505,
6.081, 1.784, 1.378, 1.45 , 1.841, 1.37 , 2.612, 5.329, 5.33 ,
3.371, 1.281, 1.871, 2.153, 5.378, 4.45 , 1.46 , 1.436, 1.612,
3.19 , 2.752, 3.35 , 4.01 , 4.166, 2.642, 2.977, 3.415, 2.295,
3.019, 2.683, 5.165, 1.849, 1.836, 2.518, 2.43 , 4.528, 1.263,
1.885, 1.943, 1.899, 1.442, 1.953, 4.697, 1.582, 2.025, 1.841,
5.011, 1.212, 1.502, 2.516, 1.367, 2.089, 4.388, 1.854, 1.748,
2.978, 2.152, 2.362, 1.988, 1.426, 3.29 , 3.264, 1.436, 1.393,
2.822, 4.969, 5.659, 3.24 , 1.693, 1.647, 2.36 , 1.792, 3.45 ,
1.516, 2.233, 2.563, 5.283, 3.885, 0.966, 2.373, 2.663, 1.251,
2.052, 3.371, 2.093, 2.   , 3.883, 3.852, 3.718, 1.732, 3.928])```

Exploratory Data Analysis

Now, let’s create a quick scatter plot and visualize the data.

```import matplotlib.pyplot as plt

# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')

# Set the margins and label axes
plt.margins(.02)
_ = plt.xlabel('Percent Illiterate')
_ = plt.ylabel('Fertility')

# Add title to the plot
_ = plt.title('Scatter Plot of Illiteracy Rate vs Fertility')

# Show the plot
plt.savefig('literacy-vs-Fertility-scatter-plot.png', dpi=300, bbox_inches='tight')
plt.show()```

Looking at the scatter plot, it appears there is a correlation between illiteracy and fertility. Let’s dive deeper into this.

Pearson’s correlation coefficient is the test statistic that measures the statistical relationship, or association, between two continuous variables.

Here, we calculate the correlation coefficient and create a scatter plot with a line of best fit.

To do this, we create two graphs. First, we create a simple scatter plot and then determine the slope and intercept using np.polyfit and overlay this plot with the scatter plot.

```import matplotlib.pyplot as plt
import numpy as np

# Show the Pearson correlation coefficient
print('The Pearson correlation coefficient is ' + str(pearson_r(illiteracy, fertility)))

# Plot the illiteracy rate versus fertility
plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)

# Perform a linear regression using np.polyfit(): a, b
slope, intercept = np.polyfit(illiteracy, fertility, 1)

# Print the results to the screen
print('slope =', slope, 'children per woman / percent illiterate')
print('intercept =', intercept, 'children per woman')

# Make theoretical line to plot
x = np.array([0,100])
y = slope * x + intercept

_ = plt.plot(x, y)

# Set the title and label axes
_ = plt.title('Scatter Plot and Linear Regression of Illiteracy Rate vs Fertility')
_ = plt.xlabel('Percent Illiterate')
_ = plt.ylabel('Fertility')

# Save and show the plot
plt.savefig('literacy-vs-fertility-scatter-plot-with-slope.png', dpi=300, bbox_inches='tight')
plt.show()```
``````The Pearson correlation coefficient is 0.8041324026815341
slope = 0.04979854809063423 children per woman / percent illiterate
intercept = 1.888050610636557 children per woman
``````

Bootstrap Replicates

Next, we will create pairs bootstrap of the literacy/fertility data and compute the 95% confidence interval of the slope.

We also create a histogram of the slope from the bootstrap samples to aid in validating our function.

```def draw_bs_pairs_linreg(x, y, size=1):
"""Perform pairs bootstrap for linear regression."""

# Set up array of indices to sample from inds
inds = np.arange(len(x))

# Initialize replicates: bs_slope_reps, bs_intercept_reps
bs_slope_reps = np.empty(size)
bs_intercept_reps = np.empty(size)

# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds, size=len(inds))
bs_x, bs_y = x[bs_inds], y[bs_inds]
bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x,bs_y,1)

return bs_slope_reps, bs_intercept_reps

# Generate replicates of slope and intercept using pairs bootstrap
bs_slope_reps, bs_intercept_reps = draw_bs_pairs_linreg(illiteracy, fertility, size=10_000)

# Compute and print 95% CI for slope
print('The 95% confidence interval for the slope is ' + str(np.percentile(bs_slope_reps, [2.5,97.5])))

# Plot the histogram
_ = plt.hist(bs_slope_reps, bins=50)
_ = plt.title('Confidence Interval of the Slope')
_ = plt.xlabel('Slope')
_ = plt.ylabel('Count')

plt.savefig('literacy-vs-fertility-histogram.png', dpi=300, bbox_inches='tight')
plt.show()```
``The 95% confidence interval for the slope is [0.04412229 0.05554294]``

Now that we have two arrays of bootstrap samples, bs_slope_reps for the slope and bs_intercept_reps for the intercept, we can visualize the confidence interval of the slope.

For performance reasons, we take the first 100 replicates and plot them over a scatter plot.

```# Generate array of x-values for bootstrap lines: x
x = np.array([0,100])

# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x, bs_slope_reps[i] * x + bs_intercept_reps[i], linewidth=0.5, alpha=0.2, color='red')

# Plot the data
_ = plt.plot(illiteracy, fertility, marker='.',linestyle='none')

# Label axes, set the margins, and show the plot
_ = plt.title('Confidence Interval of the Slope Visualized')
_ = plt.xlabel('Illiteracy')
_ = plt.ylabel('Fertility')
plt.margins(0.02)
plt.savefig('literacy-vs-fertility-ci-slope.png', dpi=300, bbox_inches='tight')
plt.show()```

It should be noted the above can be performed much easier using Seaborn, as the default setting for Seaborn’s regplot shows the confidence interval in the translucent blue band.

```ax = sns.regplot(x=illiteracy, y=fertility,marker='.');
ax.set(xlabel='Illiteracy', ylabel='Fertility',title='Seaborn Confidence Interval of the Slope Visualized')
plt.savefig('literacy-vs-fertility-seaborn-ci-slope.png', dpi=300, bbox_inches='tight')
plt.show()```

The observed correlation between female illiteracy and fertility maybe by chance. Next, we will create a hypothesis that the two are unrelated and fertility is independent of illiteracy.

We will accomplish this by permutating illiteracy and leaving the fertility values fixed. We will compute the Pearson correlation coefficient for each permutation and determine how many of our samples have a Pearson correlation coefficient greater than the observed one.

```# Compute observed correlation: r_obs
r_obs = pearson_r(illiteracy,fertility)
print('Our observed Pearson R is ' + str(r_obs))
# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(10_000)

# Draw replicates
for i in range(10000):
# Permute illiteracy measurements: illiteracy_permuted
illiteracy_permuted = np.random.permutation(illiteracy)

# Compute Pearson correlation
perm_replicates[i] = pearson_r(illiteracy_permuted,fertility)

_ = plt.title('Pearson Correlation Coefficient Permutation Replicates')
_ = plt.xlabel('Pearson Correlation Coefficient')
_ = plt.ylabel('Count')
plt.hist(perm_replicates, bins=100)
plt.savefig('literacy-vs-fertility-pearson-correlation-histogram.png', dpi=300, bbox_inches='tight')

# Compute p-value: p
p = np.sum(perm_replicates >= r_obs) / len(perm_replicates)
print('p-val =', p)```
``````Our observed Pearson R is 0.8041324026815341
p-val = 0.0``````

Because we got a probability value of zero, this shows that female literacy and fertility are correlated, and we reject the null hypothesis.

Conclusion

In this analysis, we conclude that fertility and illiteracy are correlated.

We computed a Pearson’s coefficient of correlation of .80, which concludes there is a strong positive correlation between illiteracy and fertility.

By using bootstrap sampling, we can determine confidence intervals of the slope and plot the confidence interval using both Matplotlib and Seaborn.

Finally, we create a hypothesis test from the bootstrap samples and find we cannot recreate our original value of .80 from the bootstrap samples. We reject the null hypothesis that the two variables are unrelated.