Central Limit Theorem - Advanced SQL Puzzles

When understanding the Central Limit Theorem, we must first understand a few things such as random sampling, sampling error, sample sizes, and the sampling distribution of sample means. I cover them briefly here, but I link to their Wikipedia pages if you want a more robust explanation.

Random Sample

At the very heart of inferential statistics is the idea of a random sample. Random sampling involves selecting cases or units from a population in a way that gives each unit an equal chance of being chosen, and that does not affect the selection of other units. The process begins with identifying a sampling frame, or a physical representation of the population, and selecting cases in a way that allows for all possible combinations.

Getting a truly random sample is the hardest part of inferential statistics, as you need to have the probability of getting all possible combinations, no matter how infinitely small. A random sample must be obtained by meeting specific selection criteria, such as ensuring each unit has an equal chance of being chosen and avoiding bias by using a representative sampling frame. It cannot be obtained haphazardly, such as by interviewing people who happen to pass by or using easily accessible research subjects. Without meeting these criteria, the sample cannot be considered truly random and may not provide accurate results.

Lastly, inferential statistics uses information from samples to make inferences about populations. Samples themselves are not of particular interest, as they are only useful insofar as they provide insight into the larger population. We don’t care about the sample itself, only what information it provides about the population.

Sample Error

Sample error is the degree to which the results of a sample deviate from the true value of the population being studied. It is caused by the inherent randomness of the sampling process, which can result in a sample that is not fully representative of the population. Sample error can occur in any study that relies on sampling, from public opinion polls to medical trials. The larger the sample size, the smaller the sampling error, as there is a higher probability that the sample will accurately represent the population.

Sample error can be minimized through various techniques, including increasing the sample size, using stratified or clustered sampling, and ensuring the sampling method is truly random. However, it is essential to note that even with these measures, some sample error will still exist. It is therefore important for researchers and analysts to take sample error into account when interpreting results and drawing conclusions from their data.

Sample Size

Sample size is a critical aspect of statistical analysis, as it determines the accuracy and reliability of the results obtained from a study or experiment. A sample is a subset of a population, often used in research to draw conclusions about the entire population. A sample size is the number of individuals or units that are included in the sample, and it is an essential parameter that needs to be considered when designing a study.

In general, larger sample sizes provide more accurate and reliable results than smaller sample sizes. This is because larger samples reduce the effect of sampling error and increase the statistical power of the study. Larger sample sizes reduce the impact of random variability and minimize the effect of bias, making the estimates more accurate.

However, larger sample sizes may not always be feasible, as they can be expensive, time-consuming, and impractical. In some cases, smaller samples may be sufficient to draw valid conclusions, especially if the population is homogeneous or the effect size is large. In addition, larger samples do not necessarily guarantee better results if the study design is flawed or the sampling method is biased.

Sample Distribution

In statistics, a sampling distribution is a theoretical distribution of a sample statistic that would be obtained from an infinite number of samples drawn from a population. The concept of a sampling distribution is important in inferential statistics, as it allows us to make statistical inferences about a population based on a sample of data.

Imagine taking 50 observations, computing the mean, and then repeating this process 1,000 times. If you were to plot those sample means, you would have constructed a sampling distribution of sample means.

Each sample mean is likely to differ from the true population mean to some extent due to random sampling error. The sampling distribution of the sample mean tells us how much variability we can expect in the sample mean due to random sampling error. It allows us to make statistical inferences about the true population mean with a certain level of confidence.

Central Limit Theorem

Now that we have covered the above concepts, we can understand the Central Limit Theorem.

The central limit theorem states that if you take independent random samples from any distribution, then the distribution of the sample means will tend towards a normal distribution as the sample size increases. This is true even if the original distribution is not normal.

The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

Now, we can write some Python code to show the workings of the central limit theorem.

First, we create a probability graph showing a Poisson distribution with a lambda of 5.

In probability theory, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event.

The parameter lambda (λ) in the Poisson distribution represents the average rate of occurrence of the event of interest in the given interval of time or space. It is also sometimes referred to as the rate parameter. The distribution of the Poisson random variable is completely determined by its value of λ.

For example, if λ is equal to 5, it means that, on average, five events are expected to occur in the given interval of time or space.

From this graph we can determine the distribution does not fit a normal distribution.

However, once we take sample after sample and compute each mean of the Poisson distribution, we can see the sampling distribution of sample means follows a normal distribution, with a mean of 5 and a standard deviation of 2.24.

Standard Error of the Mean

The standard error of the mean (also called the standard deviation of the sampling distribution and often abbreviated as SE or SEM) is the standard deviation of the sampling distribution of sample means (remember that the sampling distribution of sample means is obtained by taking multiple samples from a population and computing the mean for each sample. This process is repeated many times to obtain a distribution of sample means).

The concept of standard error is not limited to sample means; it can be applied to any population parameter, including mean, variance, range, and so on. The standard error measures the variability of the estimate of the population parameter, and it reflects the amount of sampling error due to random sampling variation. A smaller standard error indicates that the sample statistic is more precise and more likely to be closer to the true population parameter.

The relationship between the central limit theorem and standard error is important because it helps us understand the accuracy of our sample means. A smaller standard error indicates that the sample mean is likely closer to the true population mean. Conversely, a larger standard error indicates that the sample mean is more likely to be farther from the true population mean.

When we know the population parameters, we can determine the standard error of the mean by dividing the standard deviation by the number of samples. When we only know the sample standard deviation, we divide the sample standard deviation by the number of samples.

The standard error is a measure of the variability of the means of all possible samples of a particular size that could be taken from a population, and it is computed based on the number of observations used in each sample, not the number of samples taken. The more observations included in each sample, the better the chance each sample will represent the population and the closer the sample means will be to the population mean. Therefore, the number of observations is an important factor in determining the standard error, and it is the only factor used to compute it.

If the standard error is high, then the sample mean is less precise or less accurate in estimating the population mean, meaning that the sample is less representative of the population. On the other hand, if the standard error is low, then the sample mean is more precise or more accurate in estimating the population mean, meaning that the sample is more representative of the population.

The standard error is used in many statistical equations, including:

Confidence intervals: To calculate the range of values within which the true population mean or proportion is likely to fall.
Hypothesis testing: To calculate the test statistic and determine the probability of obtaining a sample mean or proportion as extreme as the one observed, assuming the null hypothesis is true.
Regression analysis: To calculate the standard error of the regression, which is used to estimate the precision of the estimated regression coefficients.
Meta-analysis: To calculate the standard error of the weighted mean effect size, which is used to estimate the precision of the estimated overall effect size across studies.

In general, the standard error is an important parameter in statistical analysis as it provides information about the precision of the estimates obtained from a sample and allows us to make inferences about the population parameters.

When we calculate the standard error for our example dataset, we get the following. In general, smaller standard errors suggest that the sample is more representative of the population and that the results are more reliable.

The standard error of the mean is 0.02.

Here is an example of a textbook-style question you would see in an introductory statistics class.

A population has a mean of 24.12 and a standard deviation of 4. Assume that a sampling distribution of sample means has been constructed based on repeated samples of n = 100 from this population.

a. What would be the value of the mean of the sampling distribution?
b. What would be the value of the standard error of the mean?

To answer these questions, the value of the mean of the sampling distribution would equal 24.12, the mean of the population.

The value of the standard error would be calculated at 4 / sqrt(100) = 4 / 10 and equate to .04.

Here is the Python code for the above analysis.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson

# Define the number of samples and the sample size
n_samples = 1000
sample_size = 100

# Define the mean of the Poisson distribution
lam = 5

# Generate samples from a Poisson distribution
samples = np.random.poisson(lam=5, size=(n_samples, sample_size))
# print(samples[:10])  # print the first 10 samples
print(samples.mean())
print(samples.std())
print('The Poisson mean of n_samples is {}.'.format(samples.mean()))
print('The Poisson standard deviation of n_samples is {}.'.format(samples.std()))

    
#####################################################################

# Create a Poisson distribution object
dist = poisson(mu=lam)

# Generate the first 20 values of the Poisson distribution
x = range(20)


# Compute the PMF for the first 20 values
pmf = dist.pmf(x)
print('The Poisson mean is {}.'.format(dist.mean()))
print('The Poisson standard deviation is {}.'.format(dist.std()))


# Plot the PMF
plt.bar(x, pmf)
plt.xlabel('Occurences')
plt.ylabel('Probability')
plt.title('Poisson Distribution (lambda = {})'.format(lam))
plt.savefig('central-limit-theorem-sample-poisson-histogram.png', dpi=300, bbox_inches='tight')
plt.show()

#####################################################################

# Compute the means of each sample
sample_means = np.mean(samples, axis=1)

# Plot the histogram of the sample means
plt.hist(sample_means, bins=50, density=True)

# Compute the theoretical normal distribution with the same mean and variance
mu = np.mean(sample_means)
sigma = np.std(sample_means)
x = np.linspace(mu - 5*sigma, mu + 5*sigma, 100)
y = 1/(sigma*np.sqrt(2*np.pi)) * np.exp(-(x - mu)**2/(2*sigma**2))

# Plot the theoretical normal distribution
plt.title('Sample Distribution of Sample Means')
plt.xlabel('Mean')
plt.ylabel('Frequency')
plt.plot(x, y)
plt.savefig('central-limit-theorem-sample-distribution-sample-means-histogram.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate the standard error of the mean
sem = sigma / np.sqrt(sample_size)
print('The standard error of the mean is {:.2f}.'.format(sem))