Standard Normal Curve - Advanced SQL Puzzles

The normal curve and the standard normal curve are both types of probability distributions that describe how data is distributed around a mean value. The main difference between the standard normal curve and the normal curve is that the normal curve can have any mean value and standard deviation, while the standard normal curve has a mean of 0 and a standard deviation of 1 and is considered a hypothetical curve based upon an infinite number of samples.

The normal curve is characterized by two parameters: its mean (μ) and its standard deviation (σ). It is a bell-shaped curve that is symmetric around the mean, with the probability density decreasing as the distance from the mean increases.

The standard normal curve, on the other hand, is a specific type of normal curve where the mean is 0 and the standard deviation is 1. It is often used as a reference distribution in statistical analysis since any normal distribution can be transformed into a standard normal distribution by subtracting the mean and dividing it by the standard deviation. It should also be noted that the standard normal curve is a theoretical curve because it represents an idealized mathematical model of a probability distribution.

Relating standard deviations to the area under the normal curve is a fundamental concept in statistical inference, often referred to as the 1-2-3 Rule.

One standard deviation from the mean in a normal curve covers around 68% of the area under the curve.
Two standard deviations cover approximately 95% of the area.
Three standard deviations above and below the mean cover slightly more than 99% of the area under the curve.

The Table of Areas Under the Normal Curve is a reference table used to calculate probabilities associated with a standard normal distribution. For normal distributions with larger sample sizes, we use the z-score values associated with a particular level of confidence, such as 95% or 99%, which are 1.96 and 2.58, respectively.

However, for smaller sample sizes, we use the T-distribution table, and we must account for degrees of freedom. Degrees of freedom refers to the number of independent pieces of information available in a sample to estimate a population parameter, and it represents the number of observations in a sample that are free to vary after certain restrictions have been imposed. The smaller the sample size, the wider and flatter the t-distribution curve will be.

While both curves describe the probability distribution of data, the standard normal curve has a fixed mean and standard deviation. It is based upon an infinite number of samples, while the normal curve can have any mean and standard deviation. The curve’s inflection points (at which a normal curve begins to change direction) are one standard deviation above and below the mean.

To begin with, we will generate 100,000 data points that follow a normal distribution and create a histogram to visualize the data (the blue data bars in the histogram below). Then, we will use the norm function from the scipy.stats module to create a normal (Gaussian) continuous random variable. This function takes the mean and standard deviation as input parameters to define the distribution of the random variable (the orange line outlying the histogram below).

Using the resulting norm_dist object, we can generate probability density functions, cumulative distribution functions, and random numbers that follow the normal distribution with the given mean and standard deviation.

To create an array of evenly spaced values within a given interval, we will use the Numpy function np.arange. We will use the interval of 3 standard deviation units as we know from the 1-2-3 rule that 99.7% of all values fall within 3 standard deviation units from the mean.

The resulting code will produce the following chart illustrating the standard normal curve.

The standard deviation for x is 0.9963232329077091
The mean for x is -0.004739085409034449

The code to create this chart is below.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate 1000 random values of x from the standard normal distribution
x = np.random.standard_normal(100_000)

# Plot the histogram of x
plt.hist(x, bins=50, density=True, alpha=0.5)

# Calculate the mean and standard deviation of x
mean_x = np.mean(x)
std_x = np.std(x)

# Create a normal distribution object with the mean and standard deviation values
norm_dist = norm(mean_x, std_x)

# Generate values for x from mean-3*std to mean+3*std with a step of 0.1
x_values = np.arange(mean_x-3*std_x, mean_x+3*std_x, 0.1)

# Calculate the probability density function of the normal distribution at different values of x
y_values = norm_dist.pdf(x_values)

# Plot the normal curve
plt.plot(x_values, y_values)

# Set the x-axis and y-axis labels
# plt.xlabel('X')
plt.ylabel('Probability Density')

# Set the title of the plot
plt.title('Standard Normal Curve with Mean={:.2f} and StdDev={:.2f}'.format(mean_x, std_x))

# Show the plot
plt.show()

# Print the standard deviation and mean of x
print("The standard deviation for x is", np.std(x))
print("The mean for x is", np.mean(x))

Next, we will create an ECDF (Empirical Cumulative Distribution Function) graph.

ECDFs, or empirical cumulative distribution functions, provide a way to represent the distribution of a given set of data visually. ECDFs are useful for comparing different data sets, as well as for comparing a data set to a theoretical distribution.

Unlike histograms or other types of graphs, ECDF graphs show each individual observation, eliminating the potential for binning bias. While ECDF graphs may not be as widely used as other graphs, they offer several benefits and are a valuable tool for data visualization.

To read an ECDF graph, you need to understand that it plots the cumulative distribution function (CDF) of the data on the y-axis and the corresponding values on the x-axis.

Starting from the left side of the graph, the ECDF starts at 0 and increases as more data points are included. For example, if 25% of the data points have a value less than 10, the ECDF curve will be at 0.25 when the x-axis is at 10.

The curve’s steepness indicates the density of the data points around a particular value. A steep curve indicates that the data points are tightly clustered around that value, while a shallow curve indicates that the data points are more spread out.

The ECDF graph provides a comprehensive view of the data distribution and can be used to identify trends, patterns, and outliers in the data.

The ECDF plot for the standard normal curve is represented below. Half of the observations fall below 0, and the other half fall above 0.

Here is the code to produce the above output.

import numpy as np
import matplotlib.pyplot as plt

# Generate 10000 random samples from a standard normal distribution
x = np.random.normal(size=10000)

# Sort the data in ascending order
x_sorted = np.sort(x)

# Calculate the cumulative distribution function
y = np.arange(1, len(x_sorted) + 1) / float(len(x_sorted))

# Plot the ECDF
plt.plot(x_sorted, y)

# Add axis labels and a title
plt.xlabel('X')
plt.ylabel('ECDF')
plt.title('ECDF of Standard Normal Distribution')

# Display the plot
plt.show()

One example of using an ECDF plot is by overlaying a normalized curve over another dataset. For instance, we can create an ECDF plot for the number of home runs from the top 100 home run hitters inducted into the MLB Hall of Fame and overlay a normal curve that has the same mean and standard deviation as our home run data for a comparison against the normal curve.

The following shows that the Hall Of Fame MLB home run data does not fit a standard curve.

By examining the histogram below, it is apparent that our example data is heavily skewed. However, if we had a dataset for which we were unsure whether it follows a normal distribution, we could use the ECDF described above to compare it to a normal distribution line. We can determine the overlap between the two datasets from the brownish color.

Here is the code to produce the above output.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np


def ecdf(data):
    x = np.sort(data)
    y = np.arange(1, len(x) + 1) / len(x)
    return x, y

# Top 100 Career Home Runs in the MLB Hall of Fame
hr_df = pd.DataFrame({'HR': [755,714,660,630,612,586,573,563,548,541,536,534,521,521,521,512,512,511,504,493,475,475,468,465,452,449,449,438,431,427,426,407,399,389,384,383,382,379,379,376,370,369,361,359,358,342,331,324,317,311,309,307,301,300,297,291,282,268,268,260,253,253,252,251,248,248,244,242,240,238,236,234,223,220,219,210,207,205,202,198,190,186,185,184,183,178,170,170,169,164,154,149,148,138,138,137,135,128,135,132]})

# Compute ECDF for the 'HR' column of hr_df
x_ecdf, y_ecdf = ecdf(hr_df['HR'])

# Create plot
fig, ax = plt.subplots()
ax.plot(x_ecdf, y_ecdf, label='HR')

# generate 1000 random values of x from the standard normal distribution
hr_array = hr_df['HR'].values
hr_mean = hr_array.mean()
hr_std = hr_array.std()

x = np.random.normal(hr_mean, hr_std, 100000)

# Compute ECDF
x_ecdf2, y_ecdf2 = ecdf(x)

# Create plot
ax.plot(x_ecdf2, y_ecdf2, label='Normal Curve')
ax.set_xlabel('HR')
ax.set_ylabel('% of Total')
ax.set_title('ECDF plot')
ax.legend()
plt.savefig('standard-normal-curve-comparing-homeruns.png', dpi=300, bbox_inches='tight')
plt.show()

##############################################################################

# Plot histogram of 'HR' column
plt.hist(hr_df['HR'], bins=20, density=True, color='b', alpha=0.5, label='HR')
plt.hist(x, bins=20, density=True, color='orange', alpha=0.5, label='Normal Curve')

# Create plot

plt.xlabel('HR')
plt.ylabel('Density (%)')
plt.title('Histogram of HR vs a Normal Curve')
plt.legend()
plt.savefig('standard-normal-curve-homeruns-histogram.png', dpi=300, bbox_inches='tight')
plt.show()

Standardization is a statistical technique for putting different variables on the same scale. It is useful when you want to compare scores between different types of variables. To standardize variables, you first calculate the mean and standard deviation for a variable. Then, for each observed value of the variable, you subtract the mean and divide it by the standard deviation. After standardization, each data point is expressed in terms of the number of standard deviations above or below the mean.

For instance, if a data point has a standardized value of 2, it means two standard deviations above the mean. This standardization process makes it easier to compare data across different variables, even if they have different units of measurement or scales. It provides a way to understand how an individual data point relates to the overall distribution of the data, regardless of the original measurement scale.

A real-world example of standardization is comparing individuals’ heights and weights. Without standardization, it would be difficult to compare the two variables on the same scale since they have different units of measurement. However, by standardizing both variables, you can compare them and determine whether height or weight has a greater impact on overall health.

Below, we create a normal distribution where the mean and standard deviation are both 2 (the blue histogram) and then convert it to the standard normal distribution (the orange histogram) where the mean is 0 and the standard deviation is 1.

Here is the code to produce the graph.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# generate a random dataset with a normal (but not standard) distribution
x = np.random.normal(2, 2, 100000)
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# generate a random dataset with a normal (but not standard) distribution
x = np.random.normal(2, 2, 100000)

# plot the histogram of the original dataset
plt.hist(x, bins=50, density=True, alpha=0.5)

# calculate the mean and standard deviation of the dataset
mean_x = np.mean(x)
std_x = np.std(x)

# standardize the dataset by subtracting the mean and dividing by the standard deviation
x_std = (x - mean_x) / std_x

# create a normal distribution object with the mean and standard deviation values
norm_dist = norm(mean_x, std_x)

# generate values for x from mean-3*std to mean+3*std with a step of 0.1
x_values = np.arange(mean_x-3*std_x, mean_x+3*std_x, 0.1)

# calculate the probability density function of the normal distribution at different values of x
y_values = norm_dist.pdf(x_values)

# plot the standardized dataset
plt.hist(x_std, bins=50, density=True, alpha=0.5)

# plot the normal curve
plt.plot(x_values, y_values)

# plot the standard normal curve
y_values_std = norm.pdf(x_values, loc=0, scale=1)
plt.plot(x_values, y_values_std)

# set the x-axis and y-axis labels
plt.ylabel('Probability Density')

# set the title of the plot
plt.title('Standard Normal Curve with Mean={:.2f} and StdDev={:.2f}'.format(mean_x, std_x))

# show the plot
plt.savefig('standard-normal-curve-standardized-dataset.png', dpi=300, bbox_inches='tight')
plt.show()
# plot the histogram of the original dataset
plt.hist(x, bins=50, density=True, alpha=0.5)

# calculate the mean and standard deviation of the dataset
mean_x = np.mean(x)
std_x = np.std(x)

# standardize the dataset by subtracting the mean and dividing by the standard deviation
x_std = (x - mean_x) / std_x

# create a normal distribution object with the mean and standard deviation values
norm_dist = norm(mean_x, std_x)

# generate values for x from mean-3*std to mean+3*std with a step of 0.1
x_values = np.arange(mean_x-3*std_x, mean_x+3*std_x, 0.1)

# calculate the probability density function of the normal distribution at different values of x
y_values = norm_dist.pdf(x_values)

# plot the standardized dataset
plt.hist(x_std, bins=50, density=True, alpha=0.5)

# plot the normal curve
plt.plot(x_values, y_values)

# set the x-axis and y-axis labels
plt.ylabel('Probability Density')

# set the title of the plot
plt.title('Standard Normal Curve with Mean={:.2f} and StdDev={:.2f}'.format(mean_x, std_x))

# show the plot
plt.show()

In conclusion, statistics plays a crucial role in analyzing and interpreting data across various fields. Understanding basic concepts like probability, random variables, and distributions is important for both practical and theoretical applications. Tools like histograms, ECDFs, and standardization help us visualize and compare data in meaningful ways, while concepts like the normal curve and standard normal curve allow us to make predictions and draw conclusions from data sets. Python and its libraries like NumPy, Pandas, and Matplotlib provide powerful tools for working with statistics, making it easier than ever to analyze and interpret data. By understanding these concepts and tools, we can gain valuable insights from data and make informed decisions based on evidence.

The Numpy documentation provides information on 35 different distributions. I recommend taking a look at them. You can find the documentation here.

`beta`(a, b[, size])	Draw samples from a Beta distribution.
`binomial`(n, p[, size])	Draw samples from a binomial distribution.
`chisquare`(df[, size])	Draw samples from a chi-square distribution.
`dirichlet`(alpha[, size])	Draw samples from the Dirichlet distribution.
`exponential`([scale, size])	Draw samples from an exponential distribution.
`f`(dfnum, dfden[, size])	Draw samples from an F distribution.
`gamma`(shape[, scale, size])	Draw samples from a Gamma distribution.
`geometric`(p[, size])	Draw samples from the geometric distribution.
`gumbel`([loc, scale, size])	Draw samples from a Gumbel distribution.
`hypergeometric`(ngood, nbad, nsample[, size])	Draw samples from a Hypergeometric distribution.
`laplace`([loc, scale, size])	Draw samples from the Laplace or double exponential distribution with specified location (or mean) and scale (decay).
`logistic`([loc, scale, size])	Draw samples from a logistic distribution.
`lognormal`([mean, sigma, size])	Draw samples from a log-normal distribution.
`logseries`(p[, size])	Draw samples from a logarithmic series distribution.
`multinomial`(n, pvals[, size])	Draw samples from a multinomial distribution.
`multivariate_normal`(mean, cov[, size, …)	Draw random samples from a multivariate normal distribution.
`negative_binomial`(n, p[, size])	Draw samples from a negative binomial distribution.
`noncentral_chisquare`(df, nonc[, size])	Draw samples from a noncentral chi-square distribution.
`noncentral_f`(dfnum, dfden, nonc[, size])	Draw samples from the noncentral F distribution.
`normal`([loc, scale, size])	Draw random samples from a normal (Gaussian) distribution.
`pareto`(a[, size])	Draw samples from a Pareto II or Lomax distribution with specified shape.
`poisson`([lam, size])	Draw samples from a Poisson distribution.
`power`(a[, size])	Draws samples in [0, 1] from a power distribution with positive exponent a – 1.
`rayleigh`([scale, size])	Draw samples from a Rayleigh distribution.
`standard_cauchy`([size])	Draw samples from a standard Cauchy distribution with mode = 0.
`standard_exponential`([size])	Draw samples from the standard exponential distribution.
`standard_gamma`(shape[, size])	Draw samples from a standard Gamma distribution.
`standard_normal`([size])	Draw samples from a standard Normal distribution (mean=0, stdev=1).
`standard_t`(df[, size])	Draw samples from a standard Student’s t distribution with df degrees of freedom.
`triangular`(left, mode, right[, size])	Draw samples from the triangular distribution over the interval `[left, right]`.
`uniform`([low, high, size])	Draw samples from a uniform distribution.
`vonmises`(mu, kappa[, size])	Draw samples from a von Mises distribution.
`wald`(mean, scale[, size])	Draw samples from a Wald, or inverse Gaussian, distribution.
`weibull`(a[, size])	Draw samples from a Weibull distribution.
`zipf`(a[, size])	Draw samples from a Zipf distribution.