For this analysis we will simulate a high-low card game.
Given a standard randomly shuffled 52 card deck, the card game is played by displaying an initial card and a player makes a prediction if the next card is of higher or lower value. The next card is then displayed, and the player again makes a prediction if the next card is of higher or lower value. The game continues until all cards are displayed. If the probability of the next card being higher or lower is equal, the player must make a random prediction. At the end of a trial, tally the number of correct predictions. If two cards of the same value are adjacent to each other, the player automatically cannot choose the correct outcome, as the next card is neither higher nor lower in value.
It should be noted here what we mean by the term “prediction”. We are creating a computerized simulation where we will count the number of cards available against the current displayed card and make our prediction based upon known probabilities. In a situation where the player was human, the guesses would truly be less educated (unless of course you can count cards perfectly).
Now that we understand the rules, let’s get on to the analysis. Given 1 million trials, create a statistical analysis of the outcome. How likely is it to correctly guess all 51 predictions? Note, to correctly guess all 51 predictions, we will need some amount of luck that the shuffled deck does not contain cards of the same value adjacent to each other.
For this analysis, I used SQL to create the simulation. A language such as Python may be better suited for a simulation like this, but SQL works well once you understand how to make a numbers table and combine it with a sequence and a uniqueidentifier data type.
SQL is a verbose language, so don’t expect it to be nicely compact. But reviewing the script I have included below; you can see how I don’t use conditional statements or variable declarations. I simply solve the puzzle using set-based solutions, where the only loop needed is to create each of the 1 million trials.
Now that we have that out of the way, let’s first look at the summary statistics given 1 million trails.
Looking at the summary statistics we have a good indication that the data follows a normal distribution as the values for the mean, median and mode are very close to each other, and we are lucky enough to have at least one trial where we had a perfect run of 51 correct guesses.
Reviewing a histogram, we can visually determine the data fits a normal distribution. Because it fits a normal distribution, we know that 66% of the results will fit within 1 standard deviation from the mean, 95% will fit within 2 standard deviations of the mean, and 99.7% of the results will fit within 3 standard deviations of the mean.
When looking at a box plot, we can see there are several outliers, both large and small, where the z-score will be above 3 standard deviations.
Looking at the detail, we can see by the z-score how rare the occurrences of these outliers are. It is more rare to get a mere 24 predictions correct (4.92 standard deviations from the mean) than it is to get all 51 predictions correct (4.76 standard deviations from the mean).
Out of 1 million simulations, we were able to determine all 51 predictions correctly only once, so we can literally say this was a “one in a million” occurrence. We were able to achieve only 24 correct results twice.
When reviewing a table of areas under the normal curve, any z-score above or below 3.7 has only a .01% chance of occurrence, so we know these occurrences are quite rare. However, given I ran the simulation 1 million times, I would expect to see these occurrences given the law of large numbers.
Performing these types of simulations is a fun way to learn both coding techniques and statistical analysis. I encourage everyone to try this same approach and create a fun game where you can guess the results and then test via a simulation. When performing data analysis, histograms, box plots, cumulative distributions, and scatter plots are a great resource for creating a visual understanding of your data.
Here is the SQL code written in SQL Server’s T-SQL.
-- Drop any previously created objects with the same names DROP SEQUENCE IF EXISTS dbo.CardDeckSequence; DROP TABLE IF EXISTS #Numbers; DROP TABLE IF EXISTS #CardShuffle; DROP TABLE IF EXISTS #CardShuffleResults; -- Create a temporary table with an identity column and insert 52 NULL values CREATE TABLE #Numbers ( RowNumber INTEGER IDENTITY(1,1) NOT NULL PRIMARY KEY, MyValue INTEGER ); INSERT INTO #Numbers (MyValue) VALUES (NULL); -- Insert 52 NULL values using a TOP and sys.objects INSERT INTO #Numbers (MyValue) SELECT TOP 52 NULL FROM sys.objects; -- Create a sequence that will generate card numbers from 1 to 13 CREATE SEQUENCE dbo.CardDeckSequence AS TINYINT START WITH 1 INCREMENT BY 1 MINVALUE 1 MAXVALUE 13 CYCLE; -- Create temporary tables to hold the shuffled cards and the results of the simulations CREATE TABLE #CardShuffle ( RandomNumber UNIQUEIDENTIFIER DEFAULT NEWID(), CardNumber TINYINT ); CREATE TABLE #CardShuffleResults ( Iteration INTEGER IDENTITY(1,1) PRIMARY KEY, Result INTEGER NOT NULL ); -- Initialize a counter variable to control the number of simulations to run DECLARE @Counter1 INTEGER = 1; -- Loop through the desired number of simulations WHILE @Counter1 <= 1000 --Number of simulations to run BEGIN -- Clear out the temporary tables TRUNCATE TABLE #CardShuffle; DROP TABLE IF EXISTS #CardShuffle2; DROP TABLE IF EXISTS #CardShuffle3; DROP TABLE IF EXISTS #CardShuffle4; DROP TABLE IF EXISTS #CardShuffle5; -- Insert the card numbers into the #CardShuffle table, randomize the order using a uniqueidentifier, and assign a step number to each card INSERT INTO #CardShuffle (CardNumber) SELECT (NEXT VALUE FOR dbo.CardDeckSequence) FROM #Numbers; SELECT ROW_NUMBER() OVER (ORDER BY RandomNumber) AS StepNumber, * INTO #CardShuffle2 FROM #CardShuffle ORDER BY 1; -- Calculate the number of cards that are higher, lower, or the same as the previous card in the sequence SELECT DISTINCT a.StepNumber, a.CardNumber, SUM(CASE WHEN a.CardNumber > b.CardNumber THEN 1 ELSE 0 END) AS LowerCardCount, SUM(CASE WHEN a.CardNumber < b.CardNumber THEN 1 ELSE 0 END) AS HigherCardCount, SUM(CASE WHEN a.CardNumber = b.CardNumber THEN 1 ELSE 0 END) AS SameCardCount INTO #CardShuffle3 FROM #CardShuffle2 a LEFT OUTER JOIN #CardShuffle2 b ON a.StepNumber < b.StepNumber GROUP BY a.StepNumber, a.CardNumber ORDER BY 1; -- Make a prediction of whether the next card will be higher or lower based on the counts of higher and lower cards -- Compare the prediction to the actual outcome of the next card in the sequence SELECT StepNumber, CardNumber, HigherCardCount, LowerCardCount, SameCardCount, LEAD(CardNumber,1) OVER (ORDER BY StepNumber) AS NextCardNumber, (CASE WHEN HigherCardCount > LowerCardCount THEN 'Higher' WHEN LowerCardCount > HigherCardCount THEN 'Lower' WHEN LowerCardCount = HigherCardCount