Machine Learning Over Streaming Kafka® Data

In Part 2 of this series, we introduced the steps needed for batch training in TensorFlow with some example Python code. In this next part we’ll have a look at some performance metrics and explore the results.

1. Performance Metrics

Sometimes accuracy is all that matters! (Source: Shutterstock)

It took me a reasonable amount of time to get to this point, due to a combination of factors, but mainly because it was hard to understand how well the models were performing. After I included the following metrics, things began to make more sense.

Here’s the list of metrics that can be included in training and evaluation results, and here some of the more useful metrics. some of the more useful metrics.

Accuracy—also called “Classification Accuracy”—is a value from 0 (useless) to 1.0 (perfect) and it’s computed as the ratio of the number of correct predictions to the total number of samples. However, it can be misleading and only works well if there are an approximately equal number of samples in each class; you want accuracy to increase with training.

Biodiversity loss includes species extinction—the Tasmanian Tiger (Thylacine) became extinct in 1936 (Source: https://commons.wikimedia.org/wiki/File:Thylacine.png)

Loss in TensorFlow is not a metric since you want the loss to decrease with training. Loss functions are not technically metrics either as they are more focused on the internal performance of the model itself, and often relate to the type of model used. For neural network models (which TensorFlow supports) the training and model optimization process is designed to minimize a loss function.

You may notice that the TensorFlow metric links are actually for a thing called “Keras.”

What’s this? Well, Keras is a simpler deep learning library, using Python, which runs on top of TensorFlow. If you need more control, then there are low-level APIs in TensorFlow. (TensorFlow itself is written in a combination of Python, C++, and CUDA—CUDA is a GPU programming language to enable general-purpose computing on GPUs.)

Ok, back to metrics.

A Confusion Matrix can reduce confusion—confused? (Source: Shutterstock)

Confusion Matrix

A confusion matrix is a common way of summarizing model accuracy for binary classification problems (the class being learned is simply TRUE or FALSE). It’s a simple 2×2 matrix containing counts. The rows are the actual classes (TRUE/FALSE), and the columns are the predicted classes. The values in the cells are the counts that satisfy both.

It’s not directly available in TensorFlow, but the individual metrics are available for each cell as follows:

TruePositives—top left cell, the number of samples that are TRUE and correctly predicted TRUE

TrueNegatives—bottom right cell, the number of samples which are FALSE and correctly predicted FALSE

FalseNegatives—top right cell, the number of samples which are TRUE but incorrectly predicted FALSE

FalsePositives—bottom left cell, the number of samples which are FALSE but incorrectly predicted TRUE

These metrics are particularly useful for learning in problem domains where there may be a disproportionate cost for missing/ignoring some positives (FalseNegatives), e.g., fraud detection, or misdiagnosing some negatives (FalsePositives), e.g., in medicine where a cure may have harmful side-effects and should not be used unless a diagnosis is certain.

Accuracy can also be computed as

Some previous blogs that explained accuracy and confusion (and more, including precision and recall etc.) in the context of Spark ML are here and here.

2. Example Trace

So now let’s have a look at a trace of an example run on 1 week of drone delivery data, which had the same simple shop busy/not busy rule applied for the entire week (i.e., there was no concept drift in the week of data).

python3 dronebatch.py 

   class  shop_id  shop_type  shop_location  weekday  hour  avgTime  avgDistance  avgRating  another1  another2  another3  another4  another5 

0      0        1         19             10        1     1       44            5          1         2        99         3        17       852 

1      1        2          7              8        1     1       27           19          3         2        45         8        18       801 

2      0        3         17             15        1     1       47           20          4         1        97        12        10       285 

3      1        4          1             12        1     1       24            9          5         2        55         2        11       225 

4      1        5         14             10        1     1       48            1          2         1        10        10        15       843 

dtype: object 

records  5600  columns  14  class 0  2660  class 1  2940 

Number of training samples:  3360 

Number of testing samples:  2240 

_________________________________________________________________ 

 Layer (type)                Output Shape              Param # 

================================================================= 

 dense (Dense)               (None, 128)               1792 

 

 dropout (Dropout)           (None, 128)               0 

 

 dense_1 (Dense)             (None, 256)               33024 

 

 dropout_1 (Dropout)         (None, 256)               0 

 

 dense_2 (Dense)             (None, 128)               32896 

 

 dropout_2 (Dropout)         (None, 128)               0 

 

 dense_3 (Dense)             (None, 1)                 129 

 

================================================================= 

Total params: 67,841 

Trainable params: 67,841 

Non-trainable params: 0 

_________________________________________________________________ 

None 

Epoch 1/200 

105/105 [==============================] - 1s 2ms/step - loss: 6.5520 - accuracy: 0.5077 - true_positives: 995.0000 - true_negatives: 711.0000 - false_positives: 849.0000 - false_negatives: 805.0000 

… 

Epoch 125/200 

105/105 [==============================] - 0s 1ms/step - loss: 0.2763 - accuracy: 0.8813 - true_positives: 1615.0000 - true_negatives: 1346.0000 - false_positives: 214.0000 - false_negatives: 185.0000 

testing... 

70/70 [==============================] - 0s 960us/step - loss: 0.2864 - accuracy: 0.8862 - true_positives: 1064.0000 - true_negatives: 921.0000 - false_positives: 179.0000 - false_negatives: 76.0000 

[0.28635311126708984, 0.8861607313156128, 1064.0, 921.0, 179.0, 76.0]

python3 dronebatch.py

class shop_id shop_type shop_location weekday hour avgTime avgDistance avgRating another1 another2 another3 another4 another5

0 0 1 19 10 1 1 44 5 1 2 99 3 17 852

1 1 2 7 8 1 1 27 19 3 2 45 8 18 801

2 0 3 17 15 1 1 47 20 4 1 97 12 10 285

3 1 4 1 12 1 1 24 9 5 2 55 2 11 225

4 1 5 14 10 1 1 48 1 2 1 10 10 15 843

dtype: object

records 5600 columns 14 class 0 2660 class 1 2940

Number of training samples: 3360

Number of testing samples: 2240

_________________________________________________________________

Layer (type) Output Shape Param #

=================================================================

dense (Dense) (None, 128) 1792

dropout (Dropout) (None, 128) 0

dense_1 (Dense) (None, 256) 33024

dropout_1 (Dropout) (None, 256) 0

dense_2 (Dense) (None, 128) 32896

dropout_2 (Dropout) (None, 128) 0

dense_3 (Dense) (None, 1) 129

=================================================================

Total params: 67,841

Trainable params: 67,841

Non-trainable params: 0

_________________________________________________________________

None

Epoch 1/200

105/105 [==============================] - 1s 2ms/step - loss: 6.5520 - accuracy: 0.5077 - true_positives: 995.0000 - true_negatives: 711.0000 - false_positives: 849.0000 - false_negatives: 805.0000

…

Epoch 125/200

105/105 [==============================] - 0s 1ms/step - loss: 0.2763 - accuracy: 0.8813 - true_positives: 1615.0000 - true_negatives: 1346.0000 - false_positives: 214.0000 - false_negatives: 185.0000

testing...

70/70 [==============================] - 0s 960us/step - loss: 0.2864 - accuracy: 0.8862 - true_positives: 1064.0000 - true_negatives: 921.0000 - false_positives: 179.0000 - false_negatives: 76.0000

[0.28635311126708984, 0.8861607313156128, 1064.0, 921.0, 179.0, 76.0]

What’s going on here? For a start, there are roughly equal numbers of busy/not busy observations. This turned out to be critical, as if the numbers are too unbalanced then the accuracy metric is telling you nothing useful—i.e., if there’s a 95% chance of a not busy observation then you can just guess and get approximately 0.95 accuracy. I had this problem early on so ended up having to change the rules to get an approximately 50/50 balance. There are workarounds if you are stuck with an unbalanced class distribution.

Next, the accuracy of the model improved from 0.5 to 0.88 after 125 training epochs, a significant improvement, and we didn’t need to use the whole 200 epochs. The confusion metrics tell us that this accuracy is actually pretty good, there are far more true positives and negatives than false ones.

Finally, the evaluation against a different subset of data provides reassurance that the model hasn’t been overfitted. The accuracy is 0.88, and again the true positives and negatives outweigh the false ones. Here’s the evaluation confusion matrix:

3. Measuring Training Time and Changing Batch Size

(Source: Shutterstock)

It didn’t seem very “scientific” just showing the results of a single ML trace (and in reality, I had many runs to get to this point). I was also interested in knowing how long training takes (even with such as small data sample), how repeatable the training is even with the same settings, and if batch size makes much difference to training time or accuracy etc.

So, I added the following code to accurately measure the process specific time for the fit() function:

start_t_ns = time.process_time_ns()
model.fit(drone_features, drone_labels, batch_size=BATCH_SIZE, epochs=EPOCHS, callbacks=[callback])
end_t_ns = time.process_time_ns()
time_diff = end_t_ns - start_t_ns
print("training time (ns) = ", time_diff)

start_t_ns = time.process_time_ns()

model.fit(drone_features, drone_labels, batch_size=BATCH_SIZE, epochs=EPOCHS, callbacks=[callback])

end_t_ns = time.process_time_ns()

time_diff = end_t_ns - start_t_ns

print("training time (ns) = ", time_diff)

This is the most accurate approach to time measurement in Python according to this blog.

Next, I did 3 training runs with increasing batch size (from 8 to 64) and plotted the results. The 1^st graph is for accuracy. The most obvious observation is that in some training runs the accuracy is not very good, and overall, there’s a wide spread from 0.5 to 0.9. The results for a batch size of 32 are more consistent, around 0.85 to 0.9. (but this could be luck!).

The 2^nd graph shows the training time in seconds. Again, there’s a wide spread from 5 to 92 seconds.

What’s actually going on here? Basically, for some runs, the training stops very quickly as the optimization gets stuck in a bad local minima—corresponding to results with low accuracy. Ideally you want to find the global minimum which corresponds to the optimal solution, but optimization may find a local minima which isn’t optimal and is difficult to move away from—there’s a nice 3D diagram showing this here. This is a bit like trying to find your way to the centre of a maze but getting stuck in dead ends and having to backtrack!

Try not to get stuck in a dead-end local minima (Source: Shutterstock)

Here’s a good blog that explains more about the effect of batch size on training dynamics. Another obvious conclusion to draw from all of these results is that each training run is not repeatable. This is because we are doing random sampling (shuffling) of the training set from the available data, which likely contributes to the variation in results for the larger batch sizes (from the above blog):

“Large batch size means the model makes very large gradient updates and very small gradient updates. The size of the update depends heavily on which particular samples are drawn from the dataset. On the other hand, using small batch size means the model makes updates that are all about the same size. The size of the update only weakly depends on which particular samples are drawn from the dataset.”

So, this seems like a good start to our ML experiments. What have we learned so far? Even though this isn’t a lot of training data (I had to simplify my rules several times to ensure sufficient accuracy with around 6000 observations), a large number of auto epochs, smaller batch size, and balanced class distributions can result in good results—up to 90% accuracy. But remember to look at the confusion metrics for a sanity check, and maybe do multiple training runs (saving the parameters and selecting/loading the best result) if you are sampling the training data randomly. It can take up to a minute to train a model even with this small quantity of data, so we’ll need to take training time into account as move onto potentially more demanding real-time training scenarios.

In Part 4 of this series, we’ll start to find out how to introduce and cope with concept drift in the data streams, try incremental TensorFlow training, and explore how to use Kafka for Machine Learning.