Cracking the Enigma — Convolution Neural Network Hyperparameters!

Shashank R
6 min readApr 18, 2021

As depicted in Imitation Game, Alan Turing cracked the German Enigma code with other mathematicians — which was a key milestone for the Allies to decrypt German Intelligence messages more efficiently than humans could and subsequently helped counter the prevailing Nazi evil plans. Fast-forward to ~2005 and there were challenges put out to improve face recognition which, since then, has exploded in terms of quality, moved from the purview of Federal agencies like DARPA and NIST to Commercial entities (& Social Media with all the Facebook tagging) and has helped to counter the current evil — terrorism — amongst other benefits. Convolution Neural Networks have played a key part in efficiently solving this Enigma of our times.

Convolution Neural Networks (CNN), represent a class of deep neural networks that are heavily used in the fields of analyzing and classifying images, Computer Vision, and medical analyses, as well as analyzing videos and NLP tasks such as sentence modeling. Given the processes of convolution & pooling, it significantly reduces the features that have to be learnt making it more efficient and realistic than the traditional fully connected networks — when it comes to analyzing images and videos.

CNNs train faster compared to fully connected networks given the significant reduction in parameters that need to be trained, however, one potential watch-out area is that classification (not training/testing) may potentially take longer time.

Architecture: A typical CNN architecture entails a few layers of Convolution (with Relu activation) & Pooling layers, which are followed by a layer to flatten to the tensor. It then typically feeds into traditional feed-forward fully connected layers (with Relu activation). This feeds into the output layer with appropriate activations (softmax for multi-classification model and sigmoid for binary classification model).

Architecture of typical CNN for multi-classification; Source: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53

Optimizing hyperparameters: Given the multiple layers in the network as well as options for convolution or fully connected layers, there are a lot of hyperparameters that can potentially be tweaked as one is searching for the optimal model.

How is an image interpreted: Effectively, a file represents various pixels of an image. For a 250*250 pixel colored image, it has a 3rd dimension (or channel) to represent the 3 additive primary colors: Red, Green, and Blue, which have a range from 0 to 255 — which together defined the colored image. For example, pure red will be [255, 0, 0] while black will be represented by [0, 0, 0]. White will be represented by [255, 255, 255]. On the other hand, Black & White images have only 1 channel and can be represented by 2 dimensions (height and width) only.

Example used here: There is a fantastic dataset that has been curated by Yashvardhan Jain (Ref #1 link in citations below) which comprises of pictures of hot-dogs and “not-hot-dogs” (ex: pictures of cats, dogs, random objects). The classification problem here is to train a model based on these hot-dog and not-hot-dog images, and then predict whether an unseen image fed into the model is a hot-dog or not a hot-dog.

Is definitely a hot-dog!
Is certainly not a hot-dog!

Optimizing hyperparameters: The optimal hyperparameters can be quite different depending on the data itself, model architecture, and multiple other drivers. The observations below are specific to the model that was developed here.

A grid search was performed using various hyperparameters to evaluate how the accuracy of the model changed as hyperparameters were varied. See below for sample parameters used to evaluate accuracy for different CNN and Dense layers, filter-sizes (or kernel sizes) as well as the number of nodes in the fully connected layers.

params = {
'epochs': [10, 20],
'n_cnn_layers': [1,2],
'n_dense_layers': [2,3],
'n_filtersize': [3,5],
'num_nodes':[32]
}
gs_deep = GridSearchCV(estimator = nn, param_grid=params, cv=2, verbose=12)
gs_deep.fit(X_train, y_train)

Convolution Layer Hyperparameters:

  • Filter size: The bigger the filter size, the longer model needs for training. So normally a 3*3 filter is preferred over, say, a 5*5 filter. Furthermore, odd-size filters are preferred over even-size filters, and square filters are generally preferred. (ref: #4). Smaller filters in conjunction with a greater number of filters seems to be the preferred approach. In addition, in the instance of the hotdog dataset, the 3*3 filter got around ~67% accuracy while a 5*5 filter-size got around ~55% accuracy.
  • Number of filters: This is a key hyperparameter and more filters improve performance since they can capture more details from the images. The general recommendation is also to use more filters for deeper convolution layers — for example — use 16 in the first CNN, followed by 32 in the second CNN, and so on so forth. The primary reason is that first CNN layers capture basic features like edges, curves, etc., while subsequent CNN layers capture more complex patterns. Using more filters, however, could mean slowing down learning, as well as a bigger model size — which may create challenges around saving the model for production use.
CV1  CV2 Train Acc Val Acc
32 32 0.7314 0.6696
32 64 0.7396 0.7306
  • Filter strides: Strides are used to move filters as well as the pooling grid. Higher strides can result in faster processing but run the risk of potentially missing out on some data patterns.

Pooling hyperparameters:

  • Pool Size: Increasing the pool size results in fewer parameters in the network since the image is mapped smaller, however, it means one will throw away some information. If time is not a constraint, pool size of (2,2) is recommended, while higher pool size (ex: (3,3) is recommended for faster performance.
  • Pool stride: The most common pool stride is a stride of 2. Higher stride results in faster learning, albeit, there is a small chance that accuracy might decrease. For this example, accuracy was slightly lower — but the average epoch time was much quicker, resulting in faster training.
  • Pooling method: MaxPool size has been shown to be better than AveragePooling size and is normally the current preferred method.
CV1  CV2 Pool-Stride   Train-Acc Val-Acc Avg-Epoch-time
32 64 2 0.7396 0.7306 135 sec
32 64 3 0.7632 0.7047 79 sec

Fully connected layers:

  • # of neurons: This can be optimized by using standard feed forward network training methods — with more neurons potentially resulting in better predictions, however, it may need more time to train the network.

The model was also tested with different kernel regularization rates and dropout rates of 50% for Convolution and fully connected layers — to drive regularization and reduce the overfitting.

Cracking image related problems has been the Imitation Game equivalent of our times in hunting for terrorists, and kudos to all the Alan Turings who have enabled this.

Link to model uploaded that was trained on the dataset: One key limitation was that I had to keep the filter count capped at 32 in order for the model to be below 100MB, which is GitHub’s repo limitation. This capped the accuracy at 65% while the best model had around 72% accuracy. The model had a precision around 70% and recall around 80% — implying more false positive challenges than false negative challenges — classifying more images as hot-dogs than not-hot-dogs.

Link to hotdog app:

Citations:

--

--

Shashank R

Seasoned Expert across the full spectrum of Financial Services / FinTech— experienced in Risk, Credit, Payments, Fraud, Collections, Ops strategies