Rainfall in Austin, Texas

The following is an analysis of the average rainfall in my home city of Austin, TX.

First, let’s import our modules and data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('austin_texas_rainfall.csv',sep='|')
rainfall = np.array(df['Annual'])

Let’s look at Austin’s rainfall year over year.

Looking at the graph, there doesn’t appear to be a trend from one year to the next.

plt.figure(figsize=(16, 4))
plt.bar(df['Year'],df['Annual'])
plt.xlabel("Year")
plt.ylabel("Rainfall (inches)")
plt.show()

Now let’s calculate a few summary statistics and look at a histogram to see if the average rainfall follows a normal distribution.

Looking at the graph it’s hard to tell; with only 120 years worth of data points there may be binning bias in the histogram.

avg = round(np.average(rainfall),2)
median = np.median(rainfall)

print('The median annual rainfall is ' + '{0:.2f}'.format(median))
print('The average annual rainfall is ' + '{0:.2f}'.format(avg))

plt.hist(rainfall,bins=10)
plt.xlabel("Rainfall (inches)")
plt.ylabel("Count")
plt.show()

If we create a scatter plot with a trend line, we can see that average rainfall (34 inches) has remained relatively steady over the past 120 years. The translucent light blue area surrounding the trend line gives us a 95% confidence level band.

ax = sns.regplot(x=df['Year'], y=df['Annual'], data=df);
ax.set(xlabel='Year', ylabel='Rainfall (inches)')
plt.show()

Using a box plot we can see two years with heavy rainfall that fall outside of the IQR range.

ax = sns.boxplot(rainfall)
ax.set(xlabel='Rainfall (inches)')
plt.show()

Bootstrap Sampling

Now that we have performed some basic analysis, let’s dive a little deeper to see if our data fits a normal distribution.

Here we create a empirical cumulative distribution function, along with 10,000 bootstrap samples to get a probabilistic description of the data.

To calculate an ECDF, we order the annual rainfall totals from least to greatest and then determine the percentages of where the data resides. We know the median of Austin’s rainfall is 34 inches, so half of the values will reside below the 50% y axis, and the other half will reside above the 50% y axis.

We can determine a confidence interval of the mean by creating bootstrap replicates and repeatedly taking the mean of the bootstrap sample. The gray area in the ECDF are the bootstrap replicates, which allow us to get an idea of how the rainfall distribution is spread.

We can also determine the standard error of the mean from the original rainfall array and compare it to the standard deviation of the bootstrap samples. Here we see the standard error of the mean and the standard deviation are almost identical, telling us the data fits a normal distribution.

Lastly we create a histogram of the rainfall in inches which helps ensure we created our bootstrap samples correctly.

##########################################################
#creates resusable functions
def ecdf(data):
    #Compute ECDF for a one-dimensional array of measurements.
    # Number of data points: n
    n = len(data)
    # x-data for the ECDF: x
    x = np.sort(data)
    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n
    return x, y

##########################################################

for _ in range(50):
# Generate bootstrap sample: bs_sample
    bs_sample = np.random.choice(rainfall, size=len(rainfall))

    # Compute and plot ECDF from bootstrap sample
    x, y = ecdf(bs_sample)


    _ = plt.plot(x, y, marker='.', linestyle='none',color='gray', alpha=0.1)
    # Compute and plot ECDF from original data


# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
plt.xlabel('Rainfall (inches)')
plt.ylabel('ECDF')
plt.plot(x, y, marker='.')
plt.show()

##########################################################

def bootstrap_replicate_1d(data, func):
    """Generate bootstrap replicate of 1D data."""
    bs_sample = np.random.choice(data, len(data))
    result = func(bs_sample)
    #print(result)
    return result


##########################################################

def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""
    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(shape=size)
    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data,func)
    
    #print(bs_replicates)
    return bs_replicates
    
##########################################################

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall,np.mean,10_000)
# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print('The standard error of the mean for the annual rainfall array is ' + str(sem))
# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print('The standard deviation of the bootstrap replicates is ' + str(bs_std))

print('The 95% confidence interval is ' + str(np.percentile(bs_replicates,[2.5,97.5])))

plt.ylabel('PDF')
plt.xlabel('Rainfall (inches)')
plt.hist(bs_replicates,bins=50,density=True)
plt.show()

You can take bootstrap samples of any summary statistic. Here we take bootstrap samples of the variance to see if it’s normally distributed. It appears close, but there is a slight tail to the right in the histogram.

Permutation Samples

Next we will determine if two variables have identical probability distributions by creating a function to generate permutation samples from two data sets.

In this example we will use the rainfall totals for May and August. The typically wettest and driest months of the year in Austin, TX.

First, let’s look at the histograms for May and August overlapped. From the histogram, it appears the two months have distinctly different distributions of rainfall.

plt.hist(df['May'],bins=20,color='tab:blue')
plt.hist(df['Aug'],bins=20,color='tab:orange')
plt.xlabel('Rainfall (inches)')
plt.ylabel('Count')
plt.show()

In the above histogram, there is quite a bit of overlap in the two samples. Another method to compare distributions is to create permutation samples of the May and August data and plot them as a CDF.

may = np.array(df['May'])
aug = np.array(df['Aug'])

def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""
    # Concatenate the data sets: data
    data = np.concatenate((data1,data2))
    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)
    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]
    return perm_sample_1, perm_sample_2

for _ in range(50):
    # Generate permutation samples
    perm_sample_1, perm_sample_2 = permutation_sample(may,aug)
    # Compute ECDFs
    x_1, y_1 = ecdf(perm_sample_1)
    x_2, y_2 = ecdf(perm_sample_2)
    # Plot ECDFs of permutation sample
    _ = plt.plot(x_1, y_1, marker='.', linestyle='none',
    color='red', alpha=0.02)
    _ = plt.plot(x_2, y_2, marker='.', linestyle='none',
    color='blue', alpha=0.02)

# Create and plot ECDFs from original data
x_1, y_1 = ecdf(may)
x_2, y_2 = ecdf(aug)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='tab:red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='tab:blue')
# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('Monthly rainfall (inches)')
_ = plt.ylabel('ECDF')

plt.show()

From the graph we can further conclude May and August are not identical. May, the month in red, and August never overlap. The permutation samples we created take on a purple hue and forms its own distinct band.

Conclusion

Bootstrap sampling is a powerful tool when performing exploratory data analysis and hypothesis testing.

We can use bootstrapping to calculate standard errors, construct confidence intervals, and perform hypothesis tests.

I encourage everyone to grab their local weather data and reuse the provided functions and create their own statistical analysis.