Using CNTK and Python to learn from Iris data

In this blog post we are going to implement training and evaluation ANN model based on Iris data set using CNTK and Python. The Iris data set has categorical output value which contains three classes : Sentosa, Virglica and Versicolor. The features consist of the 4 real value inputs. The Iris data set can be easily found on the internet. One of the places is on http://kaggle.com

Usually, the Iris data is given in the flowing format:

Since we are going to use CNTK we should prepare the data in cntk file format, which is far from the format we can see on the previous image. This format has different structure and looks like on the flowing image:

The difference is obvious. To transform the previous file format in to the cntk format it tooks me several minutes and now we can continue with the implementation.

First, lets implement simple python function to read the cntk format. For the implementation we are going to use CNTK MinibatchSource, which is specially developed to handle file data. The flowing python code reads the file and return the MinibatchSource.

import cntk

# The data in the file must satisfied the following format:
# |labels 0 0 1 |features 2.1 7.0 2.2 - the format consist of 4 features and one 3 component hot vector
# represents the iris flowers
def create_reader(path, is_training, input_dim, num_label_classes):

# create the streams separately for the label and for the features
labelStream = cntk.io.StreamDef(field='label', shape=num_label_classes, is_sparse=False)
featureStream = cntk.io.StreamDef(field='features', shape=input_dim, is_sparse=False)

# create deserializer by providing the file path, and related streams
deserailizer = cntk.io.CTFDeserializer(path, cntk.io.StreamDefs(labels = labelStream, features = featureStream))

# create mini batch source as function return
mb = cntk.io.MinibatchSource(deserailizer, randomize = is_training, max_sweeps = cntk.io.INFINITELY_REPEAT if is_training else 1)
return mb

The code above take several arguments:

-path - the file path where the data is stored,

-is_training - Boolean variable which indicates if the data is for training or testing. In case of training the data will be randomized.

input_dim, num_label_classes are the numbers of the input features and the output hot vector size. Those two arguments are important in order to properly parse the file.

The first method creates the two streams , which are passed as argument in order to create deserializer, and then for minibatchsource creation. The function returns minibatchsource object which the trainer uses for data handling.

Once that we implemented the data reader, we need the python function for model creation. For the Iris data set we are going to create 4-50-3 feed forward neural network, which consist of one input layer with 4 neurons, one hidden layer with 50 neurons and the output layer with 4 neurons. The hidden layer will contain tanh- activation function.

The function which creates the NN model will looks like on the flowing code snippet:

#model creation
# FFNN with one input, one hidden and one output layer 
def create_model(features, hid_dim, out_dim):
    #perform some initialization 
    with cntk.layers.default_options(init = cntk.glorot_uniform()):
        #hidden layer with hid_def number of neurons and tanh activation function
        h1=cntk.layers.Dense(hid_dim, activation= cntk.ops.tanh, name='hidLayer')(features)
        #output layer with out_dim neurons
        o = cntk.layers.Dense(out_dim, activation = None)(h1)
        return o

As can be seen Dense function creates the layer where the user has to specify the dimension of the layer, activation function and the input variable. When the hidden layer is created, input variable is set to the input data. The output layer is created for the hidden layer as input.

The one more helper function would be showing the progress of the learner. The flowing function takes the three arguments and prints the current status of the trainer.

# Function that prints the training progress
def print_training_progress(trainer, mb, frequency):
    training_loss = "NA"
    eval_error = "NA"

    if mb%frequency == 0:
        training_loss = trainer.previous_minibatch_loss_average
        eval_error = trainer.previous_minibatch_evaluation_average
        print ("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}%".format(mb, training_loss, eval_error*100))   
    return mb, training_loss, eval_error

Once we implemented all three functions we can start with CNTK learning on the Iris data.

At the beginning, we have to specify some helper variable which we will use later.

#setting up the NN type
input_dim=4
hidden_dim = 50
num_output_classes=3
input = cntk.input_variable(input_dim)
label = cntk.input_variable(num_output_classes)

Create the reader for data batching.

# Create the reader to training data set
reader_train= create_reader("C:/sc/Offline/trainData_cntk.txt",True,input_dim, num_output_classes)

Then create the NN model, with Loss and Error functions:

#Create model and Loss and Error function
z= create_model(input, hidden_dim,num_output_classes);
loss = cntk.cross_entropy_with_softmax(z, label)
label_error = cntk.classification_error(z, label)

Then we defined how look like the trainer. The trainer will be with Stochastic Gradient Decadent learner, with learning rate of 0.2

# Instantiate the trainer object to drive the model training
learning_rate = 0.2
lr_schedule = cntk.learning_parameter_schedule(learning_rate)
learner = cntk.sgd(z.parameters, lr_schedule)
trainer = cntk.Trainer(z, (loss, label_error), [learner])

Now we need to defined parameters for learning, and showing results.

# Initialize the parameters for the trainer
minibatch_size = 120 #mini batch size will be full data set
num_iterations = 20 #number of iterations 

# Map the data streams to the input and labels.
input_map = {
label  : reader_train.streams.labels,
input  : reader_train.streams.features
} 
# Run the trainer on and perform model training
training_progress_output_freq = 1

plotdata = {"batchsize":[], "loss":[], "error":[]}

As can be seen the batchsize is set to dataset size which is typical for small data sets. Since we defined minibach to dataset size, the iteration should be very small value since Iris data is very simple and the learner will find good result very fast.

Running the trainer looks very simple. For each iteration, the reader load the batch size amount of the data, and pass to the trainer. The trainer performs the learning process using SGD learner, and returns the Loss and the error value for the current iteration. Then we call print function to show the progress of the trainer.

for i in range(0, int(num_iterations)):
        # Read a mini batch from the training data file
        data=reader_train.next_minibatch(minibatch_size, input_map=input_map) 
        trainer.train_minibatch(data)
        batchsize, loss, error = print_training_progress(trainer, i, training_progress_output_freq)
        if not (loss == "NA" or error =="NA"):
            plotdata["batchsize"].append(batchsize)
            plotdata["loss"].append(loss)
            plotdata["error"].append(error)

Once the learning process completes, we can perform some result presentation.

# Plot the training loss and the training error
import matplotlib.pyplot as plt

plt.figure(1)
plt.subplot(211)
plt.plot(plotdata["batchsize"], plotdata["loss"], 'b--')
plt.xlabel('Minibatch number')
plt.ylabel('Loss')
plt.title('Minibatch run vs. Training loss')

plt.show()

plt.subplot(212)
plt.plot(plotdata["batchsize"], plotdata["error"], 'r--')
plt.xlabel('Minibatch number')
plt.ylabel('Label Prediction Error')
plt.title('Minibatch run vs. Label Prediction Error')
plt.show()

We plot the Loss and Error function converted in to total accuracy of the classifier. The folowing pictures shows those graphs.

The last part of the ML procedure is the testing or validating the model. FOr the Iris data set we prepare 20 samples which will be used for the testing. The code i similar to the previous, except we call create_reader with different file name. Then we try to evaluate the model and grab the Loss and error values, and print out.

# Read the training data
reader_test = create_reader("C:/sc/Offline/testData_cntk.txt",False, input_dim, num_output_classes)

test_input_map = {
    label  : reader_test.streams.labels,
    input  : reader_test.streams.features,
}

# Test data for trained model
test_minibatch_size = 20
num_samples = 20
num_minibatches_to_test = num_samples // test_minibatch_size
test_result = 0.0

for i in range(num_minibatches_to_test):
    
    data = reader_test.next_minibatch(test_minibatch_size,input_map = test_input_map)
    eval_error = trainer.test_minibatch(data)
    test_result = test_result + eval_error

# Average of evaluation errors of all test minibatches
print("Average test error: {0:.2f}%".format(test_result*100 / num_minibatches_to_test))

Full sample with python code and data set can be found here.