For this WordPress page we are going to work through some textbook problems from a Statistics 101 textbook.
Typically, I prefer to delve into the rationale and mechanisms behind the problems at hand. However, in this instance, we’re going to bypass the theoretical exploration and head straight into addressing the textbook problems through Python solutions.
Please note that in addressing these problems, I will employ manual computations rather than resorting to the scipy.stats library. As we’ll be solving these problems by hand, it is imperative that you know how to determine the critical value from the relevant tables, given the degrees of freedom. A good statistics book will have an appendix with the appropriate charts for reference.
We will be performing the following hypothesis tests:
1. Hypothesis Test Based on a Single Sample With š Known
2. Hypothesis Test Based on a Single Sample With š Unknown
3. Hypothesis Testing With Two Independent Samples
4. Hypothesis Testing With Two Matched Samples
5. Hypothesis Testing Using ANOVA
6. Chi-Square Test of Independence
Here is a quick review of some common symbols used in statistics.
- XĢ (pronounced as “X-bar”) represents the sample mean, which is the average value of a sample.
- Ļ (pronounced as “sigma”) represents the standard deviation of the population, which is a measure of the amount of variation or dispersion in a population of data.
- μ (pronounced as “mu”) represents the population mean, which is the average value of a population.
- s represents the standard deviation of the sample, which is a measure of the amount of variation or dispersion in a sample of data.
- n represents the sample size, which is the number of observations in a sample.
- df represents the degrees of freedom.
- α (prounced as “alpha”) represents the significance level in hypothesis testing. It represents the probability of rejecting the null hypothesis when it is in fact true, a mistake known as a Type I error.
- ME represents the margin of error.
- SEM (or SE) represents the standard error. When it is estimated it is sometimes labeled as ESEM.
On to the statistics problems…..
Hypothesis Test Based on a Single Sample With š Known
The police department of a major city reports that the mean number of auto thefts per neighborhood per year is 6.88 with a standard deviation of 1.19. As the mayor of a suburban community just outside the major city, youāre curious as to how the auto theft rate in your community compares. You determine that the mean number of auto thefts per neighborhood per year for a random sample of 15 neighborhoods in your community is 8.13. Assume that youāre working at the .05 level of significance.
a. State an appropriate null hypothesis.
b. What is the value of the calculated test statistic?
c. State your conclusion.
The null hypothesis is H0: μ = 6.88
The calculated z statistic is: 4.068259817444769
Reject the null hypothesis at the 0.05 significance level.
import math # population population_mean = 6.88 population_stdev = 1.19 # sample sample_mean = 8.13 n = 15 # Significance level and critical value significance_level = .05 critical_value = 1.96 # Calculate the Z statistic diff_means = sample_mean - population_mean # print("The difference of means is:", diff_means) # Calculate the standard Error of the mean sem = population_stdev / math.sqrt(n) # print("The standard error of the mean is:", sem) # Calculate the z statistic z_statistic = diff_means / sem # print("The z statistic is:", z_statistic) # a. State an appropriate null hypothesis. print("The null hypothesis is H0: μ =",population_mean) # b. What is the value of the calculated test statistic (Z)? print("The calculated z statistic is:",z_statistic) # c. State your conclusion. if abs(z_statistic) < critical_value: print("Accept the null hypothesis at the",significance_level, "significance level.") else: print("Reject the null hypothesis at the",significance_level, "significance level.")
Hypothesis Test Based on a Single Sample With š Unknown
The mean level of absenteeism rate for the local school district is reported as 8.45 days per year, per student. The mean rate for a sample of 30 students enrolled in a vocational training program is reported as 6.79 days per year with a standard deviation of 2.56 days. Assume that youāre working at the .05 level of significance.
a. State an appropriate null hypothesis.
b. What is the value of the calculated test statistic?
c. Identify the critical value.
d. State your conclusion.
The null hypothesis is H0: μ = 8.45
The calculated t statistic is: -3.55163845882256
The critical value is: 2.045
Reject the null hypothesis at the 0.05 significance level.
import math # Population population_mean = 8.45 # Sample sample_mean = 6.79 n = 30 sample_sd = 2.56 # Level of significance and critical value critical_value = 2.045 significance_level = .05 # Calculate the difference of means diff_means = sample_mean - population_mean # Calculate the estimated standard error of the mean esem = sample_sd / math.sqrt(n) # Calculate the t statistic t_statistic = diff_means / esem # a. State an appropriate null hypothesis. print("The null hypothesis is H0: μ =",population_mean) # b. What is the value of the calculated test statistic (t)? print("The calcualted t statistic is:", t_statistic) # c. Identify the critical value. print("The critical value is:", critical_value) # d. State your conclusion. if abs(t_statistic) >= critical_value: print("Reject the null hypothesis at the",significance_level, "significance level.") else: print("Accept the null hypothesis at the",significance_level, "significance level.")
Hypothesis Testing With Two Independent Samples
Consider a research situation investigating the potential statistical difference in drinking habits between fraternity members and non-fraternity members. Assume you are working with a .05 significance level.
Fraternity members’ weekly drinks per week: 6, 3, 2, 4, 5, 6, 7, 5, 4, 5, 4, 8, 6, 7
Non-fraternity members’ weekly drinks per week: 0, 5, 3, 4, 3, 6, 3, 6, 5, 4, 4, 2
a. Formulate an appropriate null hypothesis.
b. Calculate t statistic.
c. Identify the critical value.
d. State your conclusion.
The null hypothesis is: H0: μ1 = μ2
The calculated t statistic is: 2.1039711719961014
The critical value is: 2.06
Reject the null hypothesis at the 0.05 significance level.
import numpy as np import math # Drinks per week for fraternity and non fraternity fraternity = np.array([6,3,2,4,5,6,7,5,4,5,4,8,6,7]) non_fraternity = np.array([0,5,3,4,3,6,3,6,5,4,4,2]) # Sample size and degrees of freedom fraternity_n = len(fraternity) # 14 non_fraternity_n = len(non_fraternity) fraternity_df = fraternity_n - 1 non_fraternity_df = non_fraternity_n - 1 total_df = fraternity_df + non_fraternity_df # Significance level and critical level significance_level = .05 critical_value = 2.06 # Calculate the mean fraternity_mean = np.mean(fraternity) non_fraternity_mean = np.mean(non_fraternity) # Calculate the standard deviation and the variance fraternity_std = np.std(fraternity,ddof=1) non_fraternity_std = np.std(non_fraternity,ddof=1) fraternity_variance = np.var(fraternity,ddof=1) non_fraternity_variance = np.var(non_fraternity,ddof=1) # Calculate the standard error of difference of means standard_error_diff_means_p1 = (fraternity_df * fraternity_variance) + (non_fraternity_df * non_fraternity_variance) standard_error_diff_means_p2 = (fraternity_df + non_fraternity_df) standard_error_diff_means_p3 = ((1/fraternity_n) + (1/non_fraternity_n)) # Calculate the standard error difference between means standard_error_diff_means = math.sqrt((standard_error_diff_means_p1/standard_error_diff_means_p2) * standard_error_diff_means_p3) # Calculate the t statistic t_statistic = (fraternity_mean - non_fraternity_mean)/standard_error_diff_means # a. Formulate an appropriate null hypothesis print("The null hypothesis is H0: μ1 = μ2") # b. Calculate t statistic print("The calculated t statistic is", t_statistic) # c. Identify the critical value print("The critical value is:", significance_level, "is", critical_value) # d. State your conclusion if abs(t_statistic) > critical_value: print("Reject the null hypothesis at the",significance_level, "significance level.") else: print("Accept the null hypothesis at the",significance_level, "significance level.")
Hypothesis Testing With Two Matched Samples
Consider the set of scores which reflect the performance of a drug awareness test administered to a sample of 15 participants in a before/after test situation.
First, each of 15 participants was administered a drug awareness test, and their scores were recorded. The participants were then shown a film concerning the dangers of recreational drug use. Following exposure to the film, the 15 participants were given the drug awareness test again, and these scores were recorded.
Before: 50, 77, 67, 94, 64, 77, 85, 52, 81, 91, 52, 61, 83, 66, 71
After: 55, 79, 82, 90, 64, 83, 80, 55, 79, 91, 61, 77, 83, 70, 75
Remember, these are matches samples, so their place in their array corresponds to the individual.
The null hypothesis is H0: μ1 = μ2
The calculated t statistic is: 2.2331171200871616
The critical value is: 2.15
Reject the null hypothesis at the 0.05 significance level.
import numpy as np import math # Define the two datasets data_before = [50, 77, 67, 94, 64, 77, 85, 52, 81, 91, 52, 61, 83, 66, 71] data_after = [55, 79, 82, 90, 64, 83, 80, 55, 79, 91, 61, 77, 83, 70, 75] n = 15 # Significance level and critical value significance_level = 0.05 critical_value = 2.15 # Calculate the differences between before and after scores data_diff = np.array(data_before) - np.array(data_after) # Calculate the mean of the differences mean_of_differences = abs(np.mean(data_diff)) # Calculate the standard deviation of the differences standard_deviation_of_differences = np.sqrt(np.sum((data_diff - np.mean(data_diff))**2) / (len(data_diff) - 1)) # Calculate the estimated standard error of mean differences estimated_standard_error_of_mean_differences = standard_deviation_of_differences / math.sqrt(n) # Calculate the t statistic t_statistic = mean_of_differences / estimated_standard_error_of_mean_differences # a. State an appropriate null hypothesis. print("The null hypothesis is H0: μ1 = μ2") # b. What is the value of the calculated test statistic (t)? print("The calculated t-statistic is:", t_statistic) # c. Identify the critical value. print("The critical value is:", critical_value) # d. State your conclusion. if abs(t_statistic) >= critical_value: print("Reject the null hypothesis at the", significance_level, "significance level.") else: print("Accept the null hypothesis at the", significance_level, "significance level.")
Hypothesis Testing Using ANOVA
An evaluation survey, designed to measure perceived program effectiveness, was administered to a sample of 37 citizens who attended a community crime-prevention meeting. The respondents were asked to rate (on a scale of 1 to 12) the meeting in terms of effectiveness in presenting useful information. The responses were analyzed, based upon the place of residence of the respondentānorthern sector, southern sector, eastern, or western sectorāand the following results were found.
Northern: 3.8, 7.1, 9.6, 8.4, 5.1, 11.6, 6.2, 7.9, 9.0, 10.3
Southern: 4.2, 6.5, 4.4, 8.1, 7.6, 5.8, 4.0, 7.3, 5.2, 4.8
Eastern: 8.8, 5.1, 12.7, 6.4, 9.8, 6.3, 10.2, 8.5, 11.9, 8.6
Western: 4.8, 1.2, 8.0, 9.4, 3.6, 8.7, 6.5
a. State an appropriate null hypothesis.
b. What are the values of each category mean?
c. What is the value of the grand mean?
d. What is the value of the between-groups sum of squares?
e. What is the value of the within-groups sum of squares?
f. What is the value of the between-groups degrees of freedom?
g. What is the value of the within-groups degrees of freedom?
h. What is the value of the within-groups mean of squares?
i. What is the value of the between-groups mean of squares?
j. What is the value of F?
k. Assuming you were working at the .05 level of significance, what would you conclude?
The null hypothesis is H0: μ1 = μ2 = μ3 = μ4
Mean for Northern sector: 7.9
Mean for Southern sector: 5.79
Mean for Eastern sector: 8.83
Mean for Western sector: 6.028571428571429
Grand Mean: 7.2270270270270265
Between-Groups Sum of Squares: 60.92868725868725
Within-Groups Sum of Squares: 179.3042857142857
Between-Groups Degrees of Freedom: 3
Within-Groups Degrees of Freedom: 33
Within-Groups Mean of Squares: 5.433463203463203
Between-Groups Mean of Squares: 20.309562419562415
F value: 3.7378669292574616
Critical value: 2.92
Reject the null hypothesis at the 0.05 significance level.
import numpy as np # Define the data for each sector northern = np.array([3.8,7.1,9.6,8.4,5.1,11.6,6.2,7.9,9.0,10.3]) southern = np.array([4.2,6.5,4.4,8.1,7.6,5.8,4.0,7.3,5.2,4.8]) eastern = np.array([8.8,5.1,12.7,6.4,9.8,6.3,10.2,8.5,11.9,8.6]) western = np.array([4.8,1.2,8.0,9.4,3.6,8.7,6.5]) # Level of significance significance_level = .05 critical_value = 2.92 # Calculate the mean for each sector mean_northern = np.mean(northern) mean_southern = np.mean(southern) mean_eastern = np.mean(eastern) mean_western = np.mean(western) # Calculate the grand mean grand_mean = np.mean(np.concatenate((northern, southern, eastern, western))) # Calculate the between-groups sum of squares ss_between = (len(northern) * (mean_northern - grand_mean)**2 + len(southern) * (mean_southern - grand_mean)**2 + len(eastern) * (mean_eastern - grand_mean)**2 + len(western) * (mean_western - grand_mean)**2) # Calculate the within-groups sum of squares ss_within = np.sum((northern - mean_northern)**2) + \ np.sum((southern - mean_southern)**2) + \ np.sum((eastern - mean_eastern)**2) + \ np.sum((western - mean_western)**2) # Calculate the between-groups degrees of freedom df_between = 4 - 1 # Calculate the within-groups degrees of freedom df_within = len(northern) + len(southern) + len(eastern) + len(western) - 4 # Calculate the within-groups mean of squares ms_within = ss_within / df_within # Calculate the between-groups mean of squares ms_between = ss_between / df_between # Calculate the F statistic f_statistic = ms_between / ms_within # a. State the null hypothesis print("The null hypothesis is H0: μ1 = μ2 = μ3 = μ4") # b. Print the mean for each sector print("Mean for Northern sector:", mean_northern) print("Mean for Southern sector:", mean_southern) print("Mean for Eastern sector:", mean_eastern) print("Mean for Western sector:", mean_western) # c. Print the grand mean print("Grand Mean:", grand_mean) # d. Print the between-groups sum of squares print("Between-Groups Sum of Squares:", ss_between) # e. Print the within-groups sum of squares print("Within-Groups Sum of Squares:", ss_within) # f. Print the between-groups degrees of freedom print("Between-Groups Degrees of Freedom:", df_between) # g. Print the within-groups degrees of freedom print("Within-Groups Degrees of Freedom:", df_within) # h. Print the within-groups mean of squares print("Within-Groups Mean of Squares:", ms_within) # i. Print the between-groups mean of squares print("Between-Groups Mean of Squares:", ms_between) # j. Print the F significance level. print("F statistic:", f_statistic) # k. Compare the F-value with the critical F-statistic if f_statistic > critical_value: print("Reject the null hypothesis at the", significance_level, "significance level.") else: print("Fail to reject the null hypothesis at the", significance_level, "significance level.")
Chi-Square Test of Independence
You are interested in whether there is any association between gender and academic major. Questioning 75 students, you obtain the following results:
Business | Science | Liberal Arts | Other | Total | |
Female | 10 | 9 | 9 | 7 | 35 |
Male | 12 | 11 | 10 | 7 | 40 |
Total | 22 | 20 | 19 | 14 | 75 |
a. How many degrees of freedom are involved?
b. What is the calculated value of Chi-Square?
c. Assuming the .05 level of significance, what would you conclude?
Degrees of Freedom: 3
Chi-Square: 0.10156784005468232
The null hypothesis is not rejected. There is no significant association between gender and major.
import numpy as np # Observed data observed = np.array([[10, 9, 9, 7], [12, 11, 10, 7]]) # Calculate row, column and total sums row_totals = observed.sum(axis=1) col_totals = observed.sum(axis=0) grand_total = observed.sum() # Expected data expected = np.outer(row_totals, col_totals) / grand_total # a. Degrees of freedom df = (len(row_totals) - 1) * (len(col_totals) - 1) print(f'Degrees of Freedom: {df}') # b. Chi-square statistic chi_square = ((observed - expected)**2 / expected).sum() print(f'Chi-Square: {chi_square}') # c. Conclusion critical_value = 7.815 if chi_square > critical_value: print('The null hypothesis is rejected. There is a significant association between gender and major.') else: print('The null hypothesis is not rejected. There is no significant association between gender and major.')