Iris dataset

Ah, the Iris dataset. A rite of passage for anybody learning exploratory data analysis and machine learning.

Much has been written about the iris dataset, so I won’t go into much detail about it. Let’s get on with the show….


Exploratory Data Analysis

The first step when reviewing a dataset is to create a pair plot to 1) review the relationship between the variables, and 2) review the distribution of single variables.

From the pair plot we can determine the amount of separation in the width and length of the petals and sepals. This is evident by each color being nicely grouped together in the plots.

Also, the distribution between the species is largely bimodal; each species has a distinct mean sepal length, petal length and petal width. The exception being sepal width, as this measurement appears to have a very common size between the three species.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline

#load the datset
df_iris = sns.load_dataset('iris')

sns.pairplot(df_iris, hue="species")
plt.savefig('iris-dataset-pair-plot.png', dpi=300, bbox_inches='tight')
plt.show()

Another good plot for exploratory data analysis is the box plot.

Here we can see that sepal width has several outliers in the data, mostly due to the IQR being quite narrow, especially compared to the other box plots.

#Box plots
df_iris_boxplot = df_iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]]

sns.boxplot(data=df_iris_boxplot) 
plt.savefig('iris-dataset-boxplot.png', dpi=300, bbox_inches='tight')
plt.show()

Concentrating on sepal width and its outliers, we can compute the following summary statistics.

First, we see the data follows a normal distribution, as we initially saw in the pair plot, with a standard deviation of .43 inches.

np_iris_sepal_width = (df_iris[["sepal_width"]].values)

sns.distplot(np_iris_sepal_width)
plt.text(0.05, 0.95, 'Min sepal width: {:.2f}'.format(np.min(np_iris_sepal_width)), transform=plt.gca().transAxes)
plt.text(0.05, 0.9, 'Max sepal width: {:.2f}'.format(np.max(np_iris_sepal_width)), transform=plt.gca().transAxes)
plt.text(0.05, 0.85, 'Avg sepal width: {:.2f}'.format(np.average(np_iris_sepal_width)), transform=plt.gca().transAxes)
plt.text(0.05, 0.8, 'Std of sepal width: {:.2f}'.format(np.std(np_iris_sepal_width)), transform=plt.gca().transAxes)
plt.savefig('iris-dataset-distplot.png', dpi=300, bbox_inches='tight')
plt.show()

And here is some nifty code to review which records are causing the outliers.

q1=df_iris['sepal_width'].quantile(0.25)
q3=df_iris['sepal_width'].quantile(0.75)
iqr=q3-q1
print('The 25% quartile is ' + '{0:.3}'.format(q1))
print('The 50% quartile is ' + '{0:.3}'.format(df_iris['sepal_width'].quantile(0.5)))
print('The 75% quartile is ' + '{0:.3}'.format(q3))
print('The interquartile range is ' + '{0:.3}'.format(iqr))
lower_whisker = q1 - (1.5 * iqr)
upper_whisker = q3 + (1.5 * iqr)
print('The lower and upper whiskers of the box plot represent ' + '{0:.3}'.format(lower_whisker) + ' and ' + '{0:.3}'.format(upper_whisker))
print('###############################################################')
print('The lower outliers are:')
print('---------------------------------------------------------------')
filter = np_iris_sepal_width < lower_whisker
print(df_iris[filter])
print('###############################################################')
print('The larger outliers are:')
print('---------------------------------------------------------------')
filter = np_iris_sepal_width > upper_whisker
print(df_iris[filter])
print('###############################################################')   

Supervised Machine Learning

Now that we know the data appears to be good for a machine learning model, lets create training and test data.

X_iris = df_iris.drop('species',axis=1)
#print(X_iris.shape)

y_iris = df_iris['species']

#train test split is located in one of these, depending on your version
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1)

Gaussian Naive Bayes is a good model to start as it’s super-fast and has no hyperparameters. Gaussian Naive Bayes is often used first in a machine learning model to get a baseline for accuracy and then determine if any improvements can be made.

#choose model class
from sklearn.naive_bayes import GaussianNB

#instantiate the model
model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)

Reviewing the accuracy, we get 97.3%. Not bad for a simple yet powerful ML model!

from sklearn.metrics import accuracy_score
print('Accuracy score is ' + '{0:.2%}'.format(accuracy_score(ytest, y_model)))

Unsupervised Machine Learning

Now let’s perform an unsupervised machine learning process on the iris dataset.

Because we know through viewing the pair plots that we have good separation of data, and we can expect that a primary component analysis (PCA) would be able to reduce the number of elements and show these distinctions.

Often this dimensionality reduction is used as a visualization aid as it becomes much easier to plot data in two dimensions then the original four dimensions of the iris dataset.

from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(X_iris)
X_2D = model.transform(X_iris)
df_iris['PCA1'] = X_2D[:,0]
df_iris['PCA2'] = X_2D[:,1]
 
sns.lmplot("PCA1","PCA2",hue='species', data=df_iris, fit_reg=False);
plt.savefig('iris-dataset-pca.png', dpi=300, bbox_inches='tight')
plt.show()

As expected, the PCA was able to reduce the data into two dimensions.


Let’s perform one more unsupervised learning model on the iris dataset. This time we are going to use Gaussian mixture model. This algorithm attempts to find distinct groups of data without any labels.

from sklearn import mixture          # 1. Choose the model class
model = mixture.GaussianMixture(n_components=3, covariance_type='full')  # 2. Instantiate the model w/ hyperparameters
model.fit(X_iris)                    # 3. Fit to data. Notice y is not specified!
y_gmm = model.predict(X_iris)        # 4. Determine cluster labels

df_iris['cluster'] = y_gmm
sns.lmplot("PCA1", "PCA2", data=df_iris, hue='species',col='cluster', fit_reg=False);
plt.savefig('iris-dataset-gaussian-mixture.png', dpi=300, bbox_inches='tight')

Conclusion

The iris dataset makes for a good introductory dataset for exploratory data analysis and machine learning due to its distinct clusters. Because it is used extensively in ML training models, it seems appropriate to spend some time to learn this data set well.

Thanks for reading!