In Part 2 of this series, we introduced the steps needed for batch training in TensorFlow with some example Python code. In this next part we’ll have a look at some performance metrics and explore the results.
1. Performance Metrics
Sometimes accuracy is all that matters! (Source: Shutterstock)
It took me a reasonable amount of time to get to this point, due to a combination of factors, but mainly because it was hard to understand how well the models were performing. After I included the following metrics, things began to make more sense.
Accuracy—also called “Classification Accuracy”—is a value from 0 (useless) to 1.0 (perfect) and it’s computed as the ratio of the number of correct predictions to the total number of samples. However, it can be misleading and only works well if there are an approximately equal number of samples in each class; you want accuracy to increase with training.
Loss in TensorFlow is not a metric since you want the loss to decrease with training. Loss functions are not technically metrics either as they are more focused on the internal performance of the model itself, and often relate to the type of model used. For neural network models (which TensorFlow supports) the training and model optimization process is designed to minimize a loss function.
You may notice that the TensorFlow metric links are actually for a thing called “Keras.”
What’s this? Well, Keras is a simpler deep learning library, using Python, which runs on top of TensorFlow. If you need more control, then there are low-level APIs in TensorFlow. (TensorFlow itself is written in a combination of Python, C++, and CUDA—CUDA is a GPU programming language to enable general-purpose computing on GPUs.)
Ok, back to metrics.
A Confusion Matrix can reduce confusion—confused? (Source: Shutterstock)
A confusion matrix is a common way of summarizing model accuracy for binary classification problems (the class being learned is simply TRUE or FALSE). It’s a simple 2×2 matrix containing counts. The rows are the actual classes (TRUE/FALSE), and the columns are the predicted classes. The values in the cells are the counts that satisfy both.
It’s not directly available in TensorFlow, but the individual metrics are available for each cell as follows:
TruePositives—top left cell, the number of samples that are TRUE and correctly predicted TRUE
TrueNegatives—bottom right cell, the number of samples which are FALSE and correctly predicted FALSE
FalseNegatives—top right cell, the number of samples which are TRUE but incorrectly predicted FALSE
FalsePositives—bottom left cell, the number of samples which are FALSE but incorrectly predicted TRUE
These metrics are particularly useful for learning in problem domains where there may be a disproportionate cost for missing/ignoring some positives (FalseNegatives), e.g., fraud detection, or misdiagnosing some negatives (FalsePositives), e.g., in medicine where a cure may have harmful side-effects and should not be used unless a diagnosis is certain.
Accuracy can also be computed as
2. Example Trace
So now let’s have a look at a trace of an example run on 1 week of drone delivery data, which had the same simple shop busy/not busy rule applied for the entire week (i.e., there was no concept drift in the week of data).
class shop_id shop_type shop_location weekday hour avgTime avgDistance avgRating another1 another2 another3 another4 another5
0 0 1 19 10 1 1 44 5 1 2 99 3 17 852
1 1 2 7 8 1 1 27 19 3 2 45 8 18 801
2 0 3 17 15 1 1 47 20 4 1 97 12 10 285
3 1 4 1 12 1 1 24 9 5 2 55 2 11 225
4 1 5 14 10 1 1 48 1 2 1 10 10 15 843
records 5600 columns 14 class 0 2660 class 1 2940
Number of training samples: 3360
Number of testing samples: 2240
Layer (type) Output Shape Param #
dense (Dense) (None, 128) 1792
dropout (Dropout) (None, 128) 0
dense_1 (Dense) (None, 256) 33024
dropout_1 (Dropout) (None, 256) 0
dense_2 (Dense) (None, 128) 32896
dropout_2 (Dropout) (None, 128) 0
dense_3 (Dense) (None, 1) 129
Total params: 67,841
Trainable params: 67,841
Non-trainable params: 0
105/105 [==============================] - 1s 2ms/step - loss: 6.5520 - accuracy: 0.5077 - true_positives: 995.0000 - true_negatives: 711.0000 - false_positives: 849.0000 - false_negatives: 805.0000
105/105 [==============================] - 0s 1ms/step - loss: 0.2763 - accuracy: 0.8813 - true_positives: 1615.0000 - true_negatives: 1346.0000 - false_positives: 214.0000 - false_negatives: 185.0000
70/70 [==============================] - 0s 960us/step - loss: 0.2864 - accuracy: 0.8862 - true_positives: 1064.0000 - true_negatives: 921.0000 - false_positives: 179.0000 - false_negatives: 76.0000
[0.28635311126708984, 0.8861607313156128, 1064.0, 921.0, 179.0, 76.0]
What’s going on here? For a start, there are roughly equal numbers of busy/not busy observations. This turned out to be critical, as if the numbers are too unbalanced then the accuracy metric is telling you nothing useful—i.e., if there’s a 95% chance of a not busy observation then you can just guess and get approximately 0.95 accuracy. I had this problem early on so ended up having to change the rules to get an approximately 50/50 balance. There are workarounds if you are stuck with an unbalanced class distribution.
Next, the accuracy of the model improved from 0.5 to 0.88 after 125 training epochs, a significant improvement, and we didn’t need to use the whole 200 epochs. The confusion metrics tell us that this accuracy is actually pretty good, there are far more true positives and negatives than false ones.
Finally, the evaluation against a different subset of data provides reassurance that the model hasn’t been overfitted. The accuracy is 0.88, and again the true positives and negatives outweigh the false ones. Here’s the evaluation confusion matrix:
3. Measuring Training Time and Changing Batch Size
It didn’t seem very “scientific” just showing the results of a single ML trace (and in reality, I had many runs to get to this point). I was also interested in knowing how long training takes (even with such as small data sample), how repeatable the training is even with the same settings, and if batch size makes much difference to training time or accuracy etc.
So, I added the following code to accurately measure the process specific time for the fit() function:
start_t_ns = time.process_time_ns()
model.fit(drone_features, drone_labels, batch_size=BATCH_SIZE, epochs=EPOCHS, callbacks=[callback])
end_t_ns = time.process_time_ns()
time_diff = end_t_ns - start_t_ns
print("training time (ns) = ", time_diff)
Next, I did 3 training runs with increasing batch size (from 8 to 64) and plotted the results. The 1st graph is for accuracy. The most obvious observation is that in some training runs the accuracy is not very good, and overall, there’s a wide spread from 0.5 to 0.9. The results for a batch size of 32 are more consistent, around 0.85 to 0.9. (but this could be luck!).
The 2nd graph shows the training time in seconds. Again, there’s a wide spread from 5 to 92 seconds.
What’s actually going on here? Basically, for some runs, the training stops very quickly as the optimization gets stuck in a bad local minima—corresponding to results with low accuracy. Ideally you want to find the global minimum which corresponds to the optimal solution, but optimization may find a local minima which isn’t optimal and is difficult to move away from—there’s a nice 3D diagram showing this here. This is a bit like trying to find your way to the centre of a maze but getting stuck in dead ends and having to backtrack!
Try not to get stuck in a dead-end local minima (Source: Shutterstock)
Here’s a good blog that explains more about the effect of batch size on training dynamics. Another obvious conclusion to draw from all of these results is that each training run is not repeatable. This is because we are doing random sampling (shuffling) of the training set from the available data, which likely contributes to the variation in results for the larger batch sizes (from the above blog):
“Large batch size means the model makes very large gradient updates and very small gradient updates. The size of the update depends heavily on which particular samples are drawn from the dataset. On the other hand, using small batch size means the model makes updates that are all about the same size. The size of the update only weakly depends on which particular samples are drawn from the dataset.”
So, this seems like a good start to our ML experiments. What have we learned so far? Even though this isn’t a lot of training data (I had to simplify my rules several times to ensure sufficient accuracy with around 6000 observations), a large number of auto epochs, smaller batch size, and balanced class distributions can result in good results—up to 90% accuracy. But remember to look at the confusion metrics for a sanity check, and maybe do multiple training runs (saving the parameters and selecting/loading the best result) if you are sampling the training data randomly. It can take up to a minute to train a model even with this small quantity of data, so we’ll need to take training time into account as move onto potentially more demanding real-time training scenarios.
In the next part of this series, we’ll start to find out how to introduce and cope with concept drift in the data streams, try incremental TensorFlow training, and explore how to use Kafka for Machine Learning.