Rainfall in Austin, Texas - Advanced SQL Puzzles

The following is an analysis of the average rainfall in my home city of Austin, TX.

First, let’s import our modules and data.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('austin_texas_rainfall.csv',sep='|')
rainfall = np.array(df['Annual'])

Let’s look at Austin’s rainfall year over year.

plt.figure(figsize=(16, 4))
plt.bar(df['Year'],df['Annual'])
plt.xlabel("Year")
plt.ylabel("Rainfall (inches)")
plt.title("Austin Rainfall by Year")
plt.savefig('austin-rainfall-barchart.png', dpi=300, bbox_inches='tight')
plt.show()

Looking at the graph, there doesn’t appear to be a trend from one year to the next.

Now, let’s calculate a few summary statistics and look at a histogram to see if the average rainfall follows a normal distribution.

avg = round(np.average(rainfall),2)
median = np.median(rainfall)

print('The median annual rainfall is ' + '{0:.2f}'.format(median))
print('The average annual rainfall is ' + '{0:.2f}'.format(avg))

plt.hist(rainfall,bins=10)
plt.xlabel("Rainfall (inches)")
plt.ylabel("Count")
plt.title("Austin Rainfaill Distribution")
plt.savefig('austin-rainfall-histogram.png', dpi=300, bbox_inches='tight')
plt.show()

The median annual rainfall is 33.59
The average annual rainfall is 33.62

Looking at the graph it’s hard to tell; with only 120 years worth of data points, there may be binning bias in the histogram.

If we create a scatter plot with a trend line, we can see that average rainfall (34 inches) has remained relatively steady over the past 120 years. The translucent light blue area surrounding the trend line gives us a 95% confidence level band.

ax = sns.regplot(x=df['Year'], y=df['Annual'], data=df);
ax.set(xlabel='Year', ylabel='Rainfall (inches)')
plt.title("Austin Rainfaill Scatter Plot")
plt.savefig('austin-rainfall-scatterplot.png', dpi=300, bbox_inches='tight')
plt.show()

Using a box plot, we can see two years with heavy rainfall outside the IQR range.

ax = sns.boxplot(rainfall)
ax.set(xlabel='Rainfall (inches)')
plt.title("Austin Rainfaill Box Plot")
plt.savefig('austin-rainfall-boxplot.png', dpi=300, bbox_inches='tight')
plt.show()

Bootstrap Sampling

Now that we have performed some basic analysis let’s dive a little deeper to see if our data fits a normal distribution.

Here, we create an empirical cumulative distribution function, along with 10,000 bootstrap samples, to get a probabilistic data description.

To calculate an ECDF, we order the annual rainfall totals from least to greatest and then determine the percentages of where the data resides. We know the median of Austin’s rainfall is 34 inches, so half of the values will reside below the 50% y-axis, and the other half will reside above the 50% y-axis.

We can determine a confidence interval of the mean by creating bootstrap replicates and repeatedly taking the mean of the bootstrap sample. The gray area in the ECDF is the bootstrap replicates, which allow us to understand how the rainfall distribution is spread.

We can also determine the standard error of the mean from the original rainfall array and compare it to the standard deviation of the bootstrap samples. Here, we see the standard error of the mean and the standard deviation are almost identical, telling us the data fits a normal distribution.

Lastly, we create a histogram of the rainfall in inches, which helps ensure we created our bootstrap samples correctly.

##########################################################
#creates reusable functions
def ecdf(data):
    #Compute ECDF for a one-dimensional array of measurements.
    # Number of data points: n
    n = len(data)
    # x-data for the ECDF: x
    x = np.sort(data)
    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n
    return x, y

##########################################################

for _ in range(50):
# Generate bootstrap sample: bs_sample
    bs_sample = np.random.choice(rainfall, size=len(rainfall))

    # Compute and plot ECDF from bootstrap sample
    x, y = ecdf(bs_sample)


    _ = plt.plot(x, y, marker='.', linestyle='none',color='gray', alpha=0.1)
    # Compute and plot ECDF from original data


# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
plt.xlabel('Rainfall (inches)')
plt.ylabel('ECDF')
plt.title('ECDF of Bootstrap Samples and Original Data')
plt.plot(x, y, marker='.')
plt.savefig('austin-rainfall-ecdf-bootstrap-samples.png', dpi=300, bbox_inches='tight')
plt.show()

##########################################################

def bootstrap_replicate_1d(data, func):
    """Generate bootstrap replicate of 1D data."""
    bs_sample = np.random.choice(data, len(data))
    result = func(bs_sample)
    #print(result)
    return result


##########################################################

def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""
    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(shape=size)
    # Generate replicates
    for i in range(size):
        bs_replicates[i] = bootstrap_replicate_1d(data,func)
    
    #print(bs_replicates)
    return bs_replicates
    
##########################################################

# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall,np.mean,10_000)

# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print('The standard error of the mean for the annual rainfall array is ' + str(sem))

# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)

print('The standard deviation of the bootstrap replicates is ' + str(bs_std))
print('The 95% confidence interval is ' + str(np.percentile(bs_replicates,[2.5,97.5])))

plt.ylabel('PDF')
plt.xlabel('Rainfall (inches)')
plt.title('PDF of Bootstrap Replicates of the Mean Annual Rainfall')
plt.hist(bs_replicates,bins=50,density=True)
plt.savefig('austin-rainfall-pdf-boostrap-replicates.png', dpi=300, bbox_inches='tight')
plt.show()

The standard error of the mean for the annual rainfall array is 0.8817466121486511
The standard deviation of the bootstrap replicates is 0.8693988526989405
The 95% confidence interval is [31.95956624 35.37121795]

You can take bootstrap samples of any summary statistic. Here, we take bootstrap samples of the variance to see if it’s normally distributed. It appears close, but there is a slight tail to the right in the histogram.

# Take 10,000 bootstrap replicates of the variance: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.var, size=10000)

# Plot the probability density function of the bootstrap replicates
plt.ylabel('PDF')
plt.xlabel('Rainfall (inches)')
plt.hist(bs_replicates, bins=50, density=True)
plt.title('PDF of Bootstrap Replicates of the Variance of Annual Rainfall')
plt.savefig('austin-rainfall-pdf-bootstrap-replicates-variance.png', dpi=300, bbox_inches='tight')
plt.show()

Permutation Samples

Next, we will determine if two variables have identical probability distributions by creating a function to generate permutation samples from two data sets.

In this example, we will use the rainfall totals for May and August—the typically wettest and driest months of the year in Austin, TX.

First, let’s look at the histograms for May and August overlapped. From the histogram, it appears the two months have distinctly different rainfall distributions.

plt.hist(df['May'], bins=20, color='tab:blue')
plt.hist(df['Aug'], bins=20, color='tab:orange')
plt.xlabel('Rainfall (inches)')
plt.ylabel('Count')
plt.title('Histogram Comparison of May and August Rainfall')
plt.savefig('austin-rainfall-may-august-histogram.png', dpi=300, bbox_inches='tight')
plt.show()

In the above histogram, there is quite a bit of overlap between the two samples. Another method to compare distributions is to create permutation samples of the May and August data and plot them as a CDF.

may = np.array(df['May'])
aug = np.array(df['Aug'])

def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""
    # Concatenate the data sets: data
    data = np.concatenate((data1,data2))
    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)
    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]
    return perm_sample_1, perm_sample_2

for _ in range(50):
    # Generate permutation samples
    perm_sample_1, perm_sample_2 = permutation_sample(may,aug)
    # Compute ECDFs
    x_1, y_1 = ecdf(perm_sample_1)
    x_2, y_2 = ecdf(perm_sample_2)
    # Plot ECDFs of permutation sample
    _ = plt.plot(x_1, y_1, marker='.', linestyle='none',
    color='red', alpha=0.02)
    _ = plt.plot(x_2, y_2, marker='.', linestyle='none',
    color='blue', alpha=0.02)

# Create and plot ECDFs from original data
x_1, y_1 = ecdf(may)
x_2, y_2 = ecdf(aug)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='tab:red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='tab:blue')
# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('Monthly rainfall (inches)')
_ = plt.ylabel('ECDF')
plt.title('Permutation Test for the Difference in ECDFs of May and August Rainfall')
plt.savefig('austin-rainfall-permutation-test-difference-may-august.png', dpi=300, bbox_inches='tight')
plt.show()

From the graph, we can further conclude that May and August are not identical. May, the month in red, and August never overlap. The permutation samples we created take on a purple hue and form its own distinct band.

Conclusion

Bootstrap sampling is a powerful tool for exploratory data analysis and hypothesis testing.

We can use bootstrapping to calculate standard errors, construct confidence intervals, and perform hypothesis tests.

I encourage everyone to grab their local weather data and create their own statistical analysis.