Iris dataset - Advanced SQL Puzzles

Ah, the Iris dataset. A rite of passage for anybody learning exploratory data analysis and machine learning.

Much has been written about the Iris dataset, so I won’t go into much detail about it. Let’s get on with the show.

Exploratory Data Analysis

The first step when reviewing a dataset is to create a pair plot to 1) review the relationship between the variables and 2) review the distribution of single variables.

From the pair plot, we can determine the amount of separation in the width and length of the petals and sepals. This is evident by each color being nicely grouped together in the plots.

Also, the distribution between the species is largely bimodal; each species has a distinct mean sepal length, petal length, and petal width. The exception being sepal width, as this measurement appears to have a very common size between the three species.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline
#load the datset
df_iris = sns.load_dataset('iris')
sns.pairplot(df_iris, hue="species")
plt.savefig('iris-dataset-pair-plot.png', dpi=300, bbox_inches='tight')
plt.show()

Another good plot for exploratory data analysis is the box plot.

Here, we can see that sepal width has several outliers in the data, mostly due to the IQR being quite narrow, especially compared to the other box plots.

#Box plots
df_iris_boxplot = df_iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]]
sns.boxplot(data=df_iris_boxplot) 
plt.savefig('iris-dataset-boxplot.png', dpi=300, bbox_inches='tight')
plt.show()

We can compute the following summary statistics based on the sepal width and its outliers.

First, the data follows a normal distribution, as we initially saw in the pair plot, with a standard deviation of .43 inches.

np_iris_sepal_width = (df_iris[["sepal_width"]].values)
sns.distplot(np_iris_sepal_width)
plt.text(0.05, 0.95, 'Min sepal width: {:.2f}'.format(np.min(np_iris_sepal_width)), transform=plt.gca().transAxes)
plt.text(0.05, 0.9, 'Max sepal width: {:.2f}'.format(np.max(np_iris_sepal_width)), transform=plt.gca().transAxes)
plt.text(0.05, 0.85, 'Avg sepal width: {:.2f}'.format(np.average(np_iris_sepal_width)), transform=plt.gca().transAxes)
plt.text(0.05, 0.8, 'Std of sepal width: {:.2f}'.format(np.std(np_iris_sepal_width)), transform=plt.gca().transAxes)
plt.savefig('iris-dataset-distplot.png', dpi=300, bbox_inches='tight')
plt.show()

And here is some nifty code to review which records are causing the outliers.

q1=df_iris['sepal_width'].quantile(0.25)
q3=df_iris['sepal_width'].quantile(0.75)
iqr=q3-q1
print('The 25% quartile is ' + '{0:.3}'.format(q1))
print('The 50% quartile is ' + '{0:.3}'.format(df_iris['sepal_width'].quantile(0.5)))
print('The 75% quartile is ' + '{0:.3}'.format(q3))
print('The interquartile range is ' + '{0:.3}'.format(iqr))
lower_whisker = q1 - (1.5 * iqr)
upper_whisker = q3 + (1.5 * iqr)
print('The lower and upper whiskers of the box plot represent ' + '{0:.3}'.format(lower_whisker) + ' and ' + '{0:.3}'.format(upper_whisker))
print('###############################################################')
print('The lower outliers are:')
print('---------------------------------------------------------------')
filter = np_iris_sepal_width < lower_whisker
print(df_iris[filter])
print('###############################################################')
print('The larger outliers are:')
print('---------------------------------------------------------------')
filter = np_iris_sepal_width > upper_whisker
print(df_iris[filter])
print('###############################################################')

Supervised Machine Learning

Now that we know the data is a good fit for a machine learning model, let’s create training and test data.

X_iris = df_iris.drop('species',axis=1)
#print(X_iris.shape)
y_iris = df_iris['species']
#train test split is located in one of these, depending on your version
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1)

Gaussian Naive Bayes is an excellent model to start as it’s super-fast and has no hyperparameters. Gaussian Naive Bayes is often used first in a machine learning model to get a baseline for accuracy and then determine if any improvements can be made.

#choose model class
from sklearn.naive_bayes import GaussianNB
#instantiate the model
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)

Reviewing the accuracy, we get 97.3%. Not bad for a simple yet powerful ML model.

from sklearn.metrics import accuracy_score
print('Accuracy score is ' + '{0:.2%}'.format(accuracy_score(ytest, y_model)))

Unsupervised Machine Learning

Now, let’s perform an unsupervised machine-learning process on the Iris dataset.

Because we know through viewing the pair plots that we have good separation of data, we can expect that a primary component analysis (PCA) would be able to reduce the number of elements and show these distinctions.

Often, this dimensionality reduction is used as a visualization aid as it becomes much easier to plot data in two dimensions than the original four dimensions of the Iris dataset.

from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(X_iris)
X_2D = model.transform(X_iris)
df_iris['PCA1'] = X_2D[:,0]
df_iris['PCA2'] = X_2D[:,1]
 
sns.lmplot("PCA1","PCA2",hue='species', data=df_iris, fit_reg=False);
plt.savefig('iris-dataset-pca.png', dpi=300, bbox_inches='tight')
plt.show()

As expected, the PCA reduced the data into two dimensions.

Let’s perform one more unsupervised learning model on the Iris dataset. This time, we are going to use the Gaussian mixture model. This algorithm attempts to find distinct groups of data without any labels.

from sklearn import mixture          # 1. Choose the model class
model = mixture.GaussianMixture(n_components=3, covariance_type='full')  # 2. Instantiate the model w/ hyperparameters
model.fit(X_iris)                    # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X_iris)        # 4. Determine cluster labels
df_iris['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=df_iris, hue='species',col='cluster', fit_reg=False);
plt.savefig('iris-dataset-gaussian-mixture.png', dpi=300, bbox_inches='tight')

From the above visualization, we can conclude that we do indeed have three distinct groups.

Conclusion

Due to its distinct clusters, the Iris dataset makes for an excellent introductory dataset for exploratory data analysis and machine learning. Because it is used extensively in ML training models, it seems appropriate to spend some time learning this data set well.

Thanks for reading!