Neural Networks Tricks-Of-The-Trade

Joaquín Padilla Montani

Deep Learning with TensorFlow WS18/19 @ IST Austria

January 7th, 2019

Difficulties in Training DNN

Current DNN arquitectures can be very deep
(e.g., for object detection in high-resolution images).

This presents several challenges:

  1. Lower layers are hard to train
  2. Speed (lots of parameters; massive data sets)
  3. High risk of overfitting

1. Vanishing/Exploding Gradients Problem

Often the gradient w.r.t. weights in lower layers is very small/vanishes.

This can slow/stop training.

The opposite can also happen (e.g., RNNs), where gradients explode.

What to do about it:

  • Better (nonsaturating) activation functions
  • Smarter initializations
  • Batch normalization
  • Gradient Clipping

(Nonsaturating) Activation Functions

In [7]:
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "activations.png", width = 800)
Out[7]:

Problems:

  • Tanh, Logit: saturation
  • ReLU: "dying ReLUs" problem

Nonsaturating Activation Functions

In [8]:
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "lrelu-elu.png", width = 800)
Out[8]:
  • LeakyReLU$_\alpha (x) =$ max$(\alpha x, x)$
  • ELU$_\alpha (x) = \alpha (\text{exp}(x) - 1)$ if $x \leq 0$; the identity otherwise
In [9]:
#ICLR 2016
Image(filename = "elupaper.png", width = 600)
Out[9]:

In TensorFlow

In [ ]:
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.leaky_relu)
In [ ]:
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.elu)
In [10]:
#NIPS 2017 (tf.nn.selu)
Image(filename = "selu.png", width = 600)
Out[10]:

Xavier/Glorot and He Initialization

In [11]:
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "initializations.png", width = 600)
Out[11]:
In [ ]:
he_init = tf.variance_scaling_initializer(mode = "FAN_AVG")
hidden1 = tf.layers.dense(X, n_hidden1, activation=tf.nn.relu,
                          kernel_initializer=he_init, name="hidden1")

Batch Normalization

Problem: the distribution of a layer's input changes as the parameters of previous layers change.

Idea: before activation, zero-center and normalize the inputs (over the current mini-batch), then let the network learn two new parameters per layer for scaling and shifting.

The model learns, in each layer, the best scale and mean for its inputs.

Batch Normalization

$$\mu_B = \frac{1}{m_B} \sum_{i=1}^{m_B} \textbf{x}^{(i)}$$$$\sigma_B^2 = \frac{1}{m_B} \sum_{i=1}^{m_B} (\textbf{x}^{(i)} - \mu_B)^2$$$$\hat{x}^{(i)} = \frac{\textbf{x}^{(i)} - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}$$$$\text{z}^{(i)} = \gamma \hat{x}^{(i)} + \beta$$

During training, the layers learn $\gamma$ (the scale) and $\beta$ (the offset).

Each layer also learns an overall $\mu$ and $\sigma$, to use during testing
(instead of the batch estimations).

In TensorFlow (Construction Phase)

In [ ]:
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")

training = tf.placeholder_with_default(False, shape=(), name='training')

hidden1 = tf.layers.dense(X, n_hidden1, name="hidden1")
bn1 = tf.layers.batch_normalization(hidden1, training=training)
bn1_act = tf.nn.elu(bn1)

hidden2 = tf.layers.dense(bn1_act, n_hidden2, name="hidden2")
bn2 = tf.layers.batch_normalization(hidden2, training=training)
bn2_act = tf.nn.elu(bn2)

logits_before_bn = tf.layers.dense(bn2_act, n_outputs, name="outputs")
logits = tf.layers.batch_normalization(logits_before_bn, training=training)

In TensorFlow (Execution Phase)

In [ ]:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run([training_op, extra_update_ops],
                     feed_dict={training: True, X: X_batch, y: y_batch})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

Gradient Clipping

In [ ]:
threshold = 1.0

optimizer = tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss)
capped_gvs = [(tf.clip_by_value(grad, -threshold, threshold), var)
              for grad, var in grads_and_vars]
training_op = optimizer.apply_gradients(capped_gvs)

2. Faster Optimizers

Momentum Optimization

Regular Gradient Descent: walking down a hill

$\theta \leftarrow \theta - \eta \nabla_\theta J(\theta)$

Momentum Optimization: rolling down a hill

$\textbf{m} \leftarrow \beta \, \textbf{m} + \eta \nabla_\theta J(\theta)$
$\theta \leftarrow \theta - \textbf{m}$

The gradient is thus used as an acceleration, instead of as a speed.

The new hyperparameter $\beta$, called the $\textit{momentum}$, controls the "friction".

In [2]:
# Image from:
# www.towardsdatascience.com/
# how-to-train-neural-network-faster-with-optimizers-d297730b3713
Image(filename = "momentum.png", width = 550)
Out[2]:

Momentum Optimization

Advantages vs. GD:

  • Escape plateaus faster
  • Could help roll past local optima

Interactive applet for gaining intuition: https://distill.pub/2017/momentum/

Implementation in TensorFlow

In [ ]:
with tf.name_scope("train"):
    optimizer   = tf.train.MomentumOptimizer(learning_rate=lr, momentum=0.9)
    training_op = optimizer.minimize(loss)
    
#Even better: Nesterov Accelerated Gradient
tf.train.MomentumOptimizer(learning_rate=lr, momentum=0.9, use_nesterov=True)

Adam Optimization

Initialize $\textbf{m}$ and $\textbf{s}$ to $0$.

  1. $\textbf{m} \leftarrow \beta_1 \, \textbf{m} + (1- \beta_1) \nabla_\theta J(\theta)$
  2. $\textbf{s} \leftarrow \beta_2 \, \textbf{s} + (1- \beta_2) \nabla_\theta J(\theta) \otimes \nabla_\theta J(\theta)$
  3. $\textbf{m} \leftarrow \frac{\textbf{m}}{1-\beta_1^T}$
  4. $\textbf{s} \leftarrow \frac{\textbf{s}}{1-\beta_2^T}$
  5. $\theta \leftarrow \theta - \eta \, \textbf{m} \oslash \sqrt{\textbf{s} + \epsilon}$

Where $T$ is the iteration number.

Typical hyperparameter values (default in TensorFlow):

  • Momentum decay: $\beta_1 = 0.9$
  • Scaling decay: $\beta_2=0.999$
  • Smoothing term: $\epsilon = 10^{-8}$
  • Learning rate: $\eta = 0.001$

Implementation in TensorFlow

In [ ]:
with tf.name_scope("train"):
    optimizer   = tf.train.AdamOptimizer(learning_rate=learning_rate)
    training_op = optimizer.minimize(loss)
In [12]:
# NIPS 2017
Image(filename = "adaptivepaper.png", width = 600)
Out[12]:

Learning Rate Scheduling

In [13]:
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "learningrate.png", width = 600)
Out[13]:

Learning Rate Scheduling

Idea: start with a high learning rate, then reduce it.

Possible implementations:

  • Predetermined piecewise constant
  • Performance scheduling
    • measure the val-error every $N$ steps;
    • reduce LR by a factor of $\lambda$ when the error is stuck
  • Exponential scheduling
    • $\eta(t) = \eta_0 10^{-t/r}$
  • Power scheduling
    • $\eta(t) = \eta_0 (1+t/r)^{-c}$, hyperparam. $c$ is typically $1$

Implementation in TensorFlow (Exponential Scheduling)

In [ ]:
with tf.name_scope("train"):
    #\eta(t) = \eta_0 10^{-t/r}
    initial_learning_rate = 0.1 # eta_0
    decay_steps = 10000 # r
    decay_rate = 1/10 #10^- ...
    global_step = tf.Variable(0, trainable=False, name="global_step") # t
    learning_rate = tf.train.exponential_decay(initial_learning_rate,
                                               global_step,
                                               decay_steps,
                                               decay_rate)
    optimizer = tf.train.MomentumOptimizer(learning_rate, momentum=0.9)
    training_op = optimizer.minimize(loss, global_step=global_step)
In [ ]:
#In execution, each batch will increase global_step by 1
for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
    sess.run(training_op, feed_dict={X: X_batch, y: y_batch})

3. Avoiding Overfitting

Early Stopping

Typical situation when training large DNN:

  • train-error decreases steadily,
  • but val-error eventually starts to rise again (U-shaped curve)

Idea: stop training when val-performance starts dropping.

In [14]:
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "earlystopping.png", width = 600)
Out[14]:

Early Stopping: Possible Implementation

  • After each epoch, evaluate the model on val-set
  • If the model improved, save the current parameters
  • End training when the model didn't improve (vs. the best model so far) for a certain number of epochs
  • After training, return saved parameters

In TensorFlow

In [ ]:
n_epochs, batch_size = 1000, 20

max_checks_without_progress = 20
checks_without_progress = 0
best_loss = np.infty

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        rnd_idx = np.random.permutation(len(X_train))
        for rnd_indices in np.array_split(rnd_idx, len(X_train) // batch_size):
            X_batch, y_batch = X_train[rnd_indices], y_train[rnd_indices]
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        
        loss_val, acc_val = sess.run([loss, acc], feed_dict={X:X_val, y:y_val})
        if loss_val < best_loss:
            save_path = saver.save(sess, "./my_model.ckpt")
            best_loss = loss_val
            checks_without_progress = 0
        else:
            checks_without_progress += 1
            if checks_without_progress > max_checks_without_progress:
                print("Early stopping!")
                break

Dropout

Idea: at every training step, each neuron (except output units) has a
probability $p$ of being temporarily ignored (of being "dropped out").

Goals:

  • Units can't rely too much on one specific input connection.
  • Units can't rely too much on neighbors (prevent too much "co-adaptation").

Dropout

In [18]:
# Image from:
# https://warwick.ac.uk/fac/cross_fac/complexity/
# people/students/dtc/students2013/eyre/statsreadinggroup/
Image(filename = "dropout.png", width = 550)
Out[18]:

Dropout

Units are only dropped while training.

Problem: when testing, a given neuron will have, on average, twice as many inputs as it had during training (assuming $p = 0.5$).

Solution: divide each neuron's output by the "keep probability" $(1-p)$ during training.

Dropout

In [15]:
# Image from: Ian Goodfellow et al. "Deep Learning." MIT Press (2016).
# http://www.deeplearningbook.org/
Image(filename = "dropoutemsemble.png", width = 500)
Out[15]:

Implementation in TensorFlow (Construction Phase)

In [ ]:
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name="X")
y = tf.placeholder(tf.int32, shape=(None), name="y")
training = tf.placeholder_with_default(False, shape=(), name='training')

p = 0.5  # == 1 - keep_prob
X_drop = tf.layers.dropout(X, rate=p, training=training)

with tf.name_scope("dnn"):
    hidden1 = tf.layers.dense(X_drop, n_hidden1, activation=tf.nn.relu)
    hidden1_drop = tf.layers.dropout(hidden1, rate=p, training=training)
    
    hidden2 = tf.layers.dense(hidden1_drop, n_hidden2, activation=tf.nn.relu)
    hidden2_drop = tf.layers.dropout(hidden2, rate=p, training=training)
    
    logits = tf.layers.dense(hidden2_drop, n_outputs)
    
'''
Dropout consists in randomly setting a fraction rate of input units to 0 at
each update during training time, which helps prevent overfitting.
The units that are kept are scaled by 1 / (1 - rate), so that their sum
is unchanged at training time and inference time.

https://www.tensorflow.org/api_docs/python/tf/layers/dropout
'''

Implementation in TensorFlow (Execution Phase)

In [ ]:
n_epochs = 20
batch_size = 50

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        for X_batch, y_batch in shuffle_batch(X_train, y_train, batch_size):
            sess.run(trn_op, feed_dict={X:X_batch, y:y_batch, training:True})
        accuracy_val = accuracy.eval(feed_dict={X: X_valid, y: y_valid})
        print(epoch, "Validation accuracy:", accuracy_val)

    save_path = saver.save(sess, "./my_model_final.ckpt")

References

  • Aurélien Géron. "Hands-On Machine Learning with Scikit-Learn & TensorFlow." O'Reilly Media (2017).
    [Main reference. TensorFlow code also from here. See Chapter 11.]

  • Ian Goodfellow et al. "Deep Learning." MIT Press (2016).

  • Ashia C. Wilson et al. "The Marginal Value of Adaptive Gradient Methods in Machine Learning." NIPS (2017).

  • Djork-Arne Clevert et al. "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." ICLR (2016).

  • Günter Klambauer et al. "Self-Normalizing Neural Networks." NIPS (2017).

  • Glorot, Xavier and Yoshua Bengio. "Understanding the difficulty of training deep feedforward neural networks." AISTATS (2010).

Bonus Slides: Reusing Pretrained Layers

Model Zoos:

Ideas:

  • Freezing+Caching Lower Layers
  • Tweaking, Dropping, or Replacing the Upper Layers
  • Pretraining on an Auxiliary Task
In [17]:
#Image from: "Hand-On Machine Learning [...]" A. Géron (coursebook)
Image(filename = "transfer.png", width = 550)
Out[17]:

TensorNets Example

In [37]:
#Code from: https://github.com/taehoonlee/tensornets

import tensorflow as tf
import tensornets as nets

inputs = tf.placeholder(tf.float32, [None, 416, 416, 3])
model = nets.YOLOv2(inputs, nets.Darknet19)

img = nets.utils.load_img('cat.png')

with tf.Session() as sess:
    sess.run(model.pretrained())
    preds = sess.run(model, {inputs: model.preprocess(img)})
    boxes = model.get_boxes(preds, img.shape[1:3])

TensorNets Example

In [41]:
#Code from: https://github.com/taehoonlee/tensornets

%matplotlib inline
from tensornets.datasets import voc
print("%s: %s" % (voc.classnames[7], boxes[7][0]))  # 7 is cat

import numpy as np
import matplotlib.pyplot as plt
box = boxes[7][0]
plt.imshow(img[0].astype(np.uint8))
plt.gca().add_patch(plt.Rectangle(
    (box[0], box[1]), box[2] - box[0], box[3] - box[1],
    fill=False, edgecolor='r', linewidth=2))
plt.show()
cat: [103.         15.        356.        267.          0.9605811]