The following is an analysis of the average rainfall in my home city of Austin, TX.

First, let’s import our modules and data.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv('austin_texas_rainfall.csv',sep='|') rainfall = np.array(df['Annual'])

Let’s look at Austin’s rainfall year over year.

Looking at the graph, there doesn’t appear to be a trend from one year to the next.

plt.figure(figsize=(16, 4)) plt.bar(df['Year'],df['Annual']) plt.xlabel("Year") plt.ylabel("Rainfall (inches)") plt.title("Austin Rainfall by Year") plt.savefig('austin-rainfall-barchart.png', dpi=300, bbox_inches='tight') plt.show()

Now let’s calculate a few summary statistics and look at a histogram to see if the average rainfall follows a normal distribution.

Looking at the graph it’s hard to tell; with only 120 years worth of data points there may be binning bias in the histogram.

avg = round(np.average(rainfall),2) median = np.median(rainfall) print('The median annual rainfall is ' + '{0:.2f}'.format(median)) print('The average annual rainfall is ' + '{0:.2f}'.format(avg)) plt.hist(rainfall,bins=10) plt.xlabel("Rainfall (inches)") plt.ylabel("Count") plt.title("Austin Rainfaill Distribution") plt.savefig('austin-rainfall-histogram.png', dpi=300, bbox_inches='tight') plt.show()

```
The median annual rainfall is 33.59
The average annual rainfall is 33.62
```

If we create a scatter plot with a trend line, we can see that average rainfall (34 inches) has remained relatively steady over the past 120 years. The translucent light blue area surrounding the trend line gives us a 95% confidence level band.

ax = sns.regplot(x=df['Year'], y=df['Annual'], data=df); ax.set(xlabel='Year', ylabel='Rainfall (inches)') plt.title("Austin Rainfaill Scatter Plot") plt.savefig('austin-rainfall-scatterplot.png', dpi=300, bbox_inches='tight') plt.show()

Using a box plot we can see two years with heavy rainfall that fall outside of the IQR range.

ax = sns.boxplot(rainfall) ax.set(xlabel='Rainfall (inches)') plt.title("Austin Rainfaill Box Plot") plt.savefig('austin-rainfall-boxplot.png', dpi=300, bbox_inches='tight') plt.show()

**Bootstrap Sampling**

Now that we have performed some basic analysis, let’s dive a little deeper to see if our data fits a normal distribution.

Here we create a empirical cumulative distribution function, along with 10,000 bootstrap samples to get a probabilistic description of the data.

To calculate an ECDF, we order the annual rainfall totals from least to greatest and then determine the percentages of where the data resides. We know the median of Austin’s rainfall is 34 inches, so half of the values will reside below the 50% y axis, and the other half will reside above the 50% y axis.

We can determine a confidence interval of the mean by creating bootstrap replicates and repeatedly taking the mean of the bootstrap sample. The gray area in the ECDF are the bootstrap replicates, which allow us to get an idea of how the rainfall distribution is spread.

We can also determine the standard error of the mean from the original rainfall array and compare it to the standard deviation of the bootstrap samples. Here we see the standard error of the mean and the standard deviation are almost identical, telling us the data fits a normal distribution.

Lastly we create a histogram of the rainfall in inches which helps ensure we created our bootstrap samples correctly.

########################################################## #creates reusable functions def ecdf(data): #Compute ECDF for a one-dimensional array of measurements. # Number of data points: n n = len(data) # x-data for the ECDF: x x = np.sort(data) # y-data for the ECDF: y y = np.arange(1, n+1) / n return x, y ########################################################## for _ in range(50): # Generate bootstrap sample: bs_sample bs_sample = np.random.choice(rainfall, size=len(rainfall)) # Compute and plot ECDF from bootstrap sample x, y = ecdf(bs_sample) _ = plt.plot(x, y, marker='.', linestyle='none',color='gray', alpha=0.1) # Compute and plot ECDF from original data # Compute and plot ECDF from original data x, y = ecdf(rainfall) plt.xlabel('Rainfall (inches)') plt.ylabel('ECDF') plt.title('ECDF of Bootstrap Samples and Original Data') plt.plot(x, y, marker='.') plt.savefig('austin-rainfall-ecdf-bootstrap-samples.png', dpi=300, bbox_inches='tight') plt.show() ########################################################## def bootstrap_replicate_1d(data, func): """Generate bootstrap replicate of 1D data.""" bs_sample = np.random.choice(data, len(data)) result = func(bs_sample) #print(result) return result ########################################################## def draw_bs_reps(data, func, size=1): """Draw bootstrap replicates.""" # Initialize array of replicates: bs_replicates bs_replicates = np.empty(shape=size) # Generate replicates for i in range(size): bs_replicates[i] = bootstrap_replicate_1d(data,func) #print(bs_replicates) return bs_replicates ########################################################## # Take 10,000 bootstrap replicates of the mean: bs_replicates bs_replicates = draw_bs_reps(rainfall,np.mean,10_000) # Compute and print SEM sem = np.std(rainfall) / np.sqrt(len(rainfall)) print('The standard error of the mean for the annual rainfall array is ' + str(sem)) # Compute and print standard deviation of bootstrap replicates bs_std = np.std(bs_replicates) print('The standard deviation of the bootstrap replicates is ' + str(bs_std)) print('The 95% confidence interval is ' + str(np.percentile(bs_replicates,[2.5,97.5]))) plt.ylabel('PDF') plt.xlabel('Rainfall (inches)') plt.title('PDF of Bootstrap Replicates of the Mean Annual Rainfall') plt.hist(bs_replicates,bins=50,density=True) plt.savefig('austin-rainfall-pdf-boostrap-replicates.png', dpi=300, bbox_inches='tight') plt.show()

```
The standard error of the mean for the annual rainfall array is 0.8817466121486511
The standard deviation of the bootstrap replicates is 0.8693988526989405
The 95% confidence interval is [31.95956624 35.37121795]
```

You can take bootstrap samples of any summary statistic. Here we take bootstrap samples of the variance to see if it’s normally distributed. It appears close, but there is a slight tail to the right in the histogram.

# Take 10,000 bootstrap replicates of the variance: bs_replicates bs_replicates = draw_bs_reps(rainfall, np.var, size=10000) # Plot the probability density function of the bootstrap replicates plt.ylabel('PDF') plt.xlabel('Rainfall (inches)') plt.hist(bs_replicates, bins=50, density=True) plt.title('PDF of Bootstrap Replicates of the Variance of Annual Rainfall') plt.savefig('austin-rainfall-pdf-bootstrap-replicates-variance.png', dpi=300, bbox_inches='tight') plt.show()

**Permutation Samples**

Next we will determine if two variables have identical probability distributions by creating a function to generate permutation samples from two data sets.

In this example we will use the rainfall totals for May and August. The typically wettest and driest months of the year in Austin, TX.

First, let’s look at the histograms for May and August overlapped. From the histogram, it appears the two months have distinctly different distributions of rainfall.

plt.hist(df['May'], bins=20, color='tab:blue') plt.hist(df['Aug'], bins=20, color='tab:orange') plt.xlabel('Rainfall (inches)') plt.ylabel('Count') plt.title('Histogram Comparison of May and August Rainfall') plt.savefig('austin-rainfall-may-august-histogram.png', dpi=300, bbox_inches='tight') plt.show()

In the above histogram, there is quite a bit of overlap in the two samples. Another method to compare distributions is to create permutation samples of the May and August data and plot them as a CDF.

may = np.array(df['May']) aug = np.array(df['Aug']) def permutation_sample(data1, data2): """Generate a permutation sample from two data sets.""" # Concatenate the data sets: data data = np.concatenate((data1,data2)) # Permute the concatenated array: permuted_data permuted_data = np.random.permutation(data) # Split the permuted array into two: perm_sample_1, perm_sample_2 perm_sample_1 = permuted_data[:len(data1)] perm_sample_2 = permuted_data[len(data1):] return perm_sample_1, perm_sample_2 for _ in range(50): # Generate permutation samples perm_sample_1, perm_sample_2 = permutation_sample(may,aug) # Compute ECDFs x_1, y_1 = ecdf(perm_sample_1) x_2, y_2 = ecdf(perm_sample_2) # Plot ECDFs of permutation sample _ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red', alpha=0.02) _ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue', alpha=0.02) # Create and plot ECDFs from original data x_1, y_1 = ecdf(may) x_2, y_2 = ecdf(aug) _ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='tab:red') _ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='tab:blue') # Label axes, set margin, and show plot plt.margins(0.02) _ = plt.xlabel('Monthly rainfall (inches)') _ = plt.ylabel('ECDF') plt.title('Permutation Test for the Difference in ECDFs of May and August Rainfall') plt.savefig('austin-rainfall-permutation-test-difference-may-august.png', dpi=300, bbox_inches='tight') plt.show()

From the graph we can further conclude May and August are not identical. May, the month in red, and August never overlap. The permutation samples we created take on a purple hue and forms its own distinct band.

**Conclusion**

Bootstrap sampling is a powerful tool when performing exploratory data analysis and hypothesis testing.

We can use bootstrapping to calculate standard errors, construct confidence intervals, and perform hypothesis tests.

I encourage everyone to grab their local weather data and reuse the provided functions and create their own statistical analysis.