These are the notes that I wrote while attending the first three courses from Deep Learning Specialization program at Coursera. I you have attended/are attending the Deep Learning Specialization program, this note would straight away look familiar to you.

So here it goes.

~

# Representation and Notations

These are the conventions used throughout the course.

L

Number of layers in the network. In the example diagram above, there are four layers in the network (L=4), on which three are hidden layers. By convention the input layer (X) is not counted as layer.

l

Layer indices. For example “l=1” means the first layer, “l=L” means the last layer or the output layer. As additional convention, l=0 is used to call the input layer.

m

Number of (training) samples.

_[l]

Superscript square bracket indicates the layer number. For example, a[2] means the activation values (a vector) for the second layer.

_(i)

Superscript round bracket indicates the sample number. For example, x(i) means the i-th input sample (which itself is a vector)

..i

Subscript index indicates the feature number or unit number. For example, xindicates the first input feature, aindicates the activation of the second unit in a particular layer, etc.

n[l]

n denotes the number of units, and square bracket superscript is used to indicate the layer number. Hence n[0] means the number of units in the input layer (three in the example above),  n[2] means the number of units in layer 2 (five in the diagram), and so on.

x

The input vector containing one training sample:

$x = \left[\begin{array}{cc}x_1\\ x_2\\..\\x_{n^{[0]}} \end{array}\right]$

X

X is a matrix containing all input samples. The dimension is nx by m, where nx is the number of features (technically it’s more correct to call this n[0], but the prof is not too consistent about this either) and m is the number of samples that we have.

It will look like this:

$X = \begin{bmatrix} x_1^{(1)} & x_1^{(2)} & \cdots & x_1^{(m)} \\ x_2^{(1)} & x_2^{(2)} & \cdots & x_2^{(m)} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n_x}^{(1)} & x_{n_x}^{(2)} & \cdots & x_{n_x}^{(m)} \end{bmatrix}$

a[l]

Activation units in layer l. For one particular layer, this is a column vector of n[l] dimension:

$a^{[l]} = \left[\begin{array}{cc}a_1\\ a_2\\..\\a_{n^{[l]}} \end{array}\right]$

A[l]

For vectorized implementation across all samples, A[l] represents the activation values for all units in layer l across all samples:

$A^{[l]} = \begin{bmatrix} a_1^{(1)} & a_1^{(2)} & \cdots & a_1^{(m)} \\ a_2^{(1)} & a_2^{(2)} & \cdots & a_2^{(m)} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n^{[l]}}^{(1)} & a_{n^{[l]}}^{(2)} & \cdots & a_{n^{[l]}}^{(m)} \end{bmatrix}$

z[l]

Logits in layer l. The dimension is the same as a[l].

Z[l]

Vectorized data of Z across all samples. Similar format and dimension to A[l].

b[l]

Bias in layer l. The dimension is the same as a[l].

W[l]

Matrix of weights in layer l, which dimension is n[l] by n[l-1].

~

# Forward Propagation

## Initialization

Initialize W with small random variables. Initialize b with zeros.

There are more advanced initialization techniques in Weight Initialization section.

## Forward Propagation

For each layer l, the vectorized forward propagation formulas are:

$Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}$

$A^{[l]} = g^{[l]} ( Z^{[l]} )$

where g[l](z) is the activation function for layer l. See below.

## Activation Functions

At each node, an activation function is used to turn Z, which is a linear value, into A, a non-linear value. Without activation functions, the whole neural network no matter how large or deep would just be a silly linear function!

Each layer can use different activation functions, hence the activation function is indexed by the layer number (the g[l](z) notation).

### Sigmoid

$a = g(z) = \frac{1}{1 + e^{-z}}$

$g'(z) = g(z) (1 - g(z)) = a(1-a)$

Sigmoid activation function is typically used as the activation function of the output layer where the output is binary classification (1 and 0), because g(z) natively outputs values between 0 and 1.

For other layers, ReLU or tanh are prefered (see below), and for multiclass output, softmax is used (also see below).

### tanh

$a = g(z) = tanh(z)= \frac{e^z - e ^ {-z}}{e^z + e ^ {-z}}$

$g'(z) = 1 - (tanh(z))^2 = 1-a^2$

### ReLU (Rectified Linear Unit)

One of the drawback of sigmoid and tanh functions is the gradient becomes very small if the value of Z is positively large or negatively small, making the descent slow. The ReLU activation function solves this.

$g(z) = max(0, z)$

$g'(z) = \begin{cases}0 & \text{if z} < 0\\1 & \text{if z} \geq 0\end{cases}$

### Leaky ReLU

$g(z) = max(0.01z, z)$

$g'(z) = \begin{cases}0.01 & \text{if z} < 0\\1 & \text{if z} \geq 0\end{cases}$

## Cost Function

### Loss Function

The loss function for logistic regression for binary classification measures how well we’re doing with respect to single training example:

$L(a,y) = -y \log a - (1-y) \log (1-a)$

### Cost Function

The cost function measures how well we’re doing over the whole training set:

$J(w,b) = \frac{1}{m} \sum_{i=1}^m L(a^{(i)}, y^{(i)})$

~

# Backward Propagation

## Initialization

For logistic regression/classification problems (binary or multiclass), we calculate dZ[L] as:

$\begin{array} {lcl} dZ^{[L]} & = & \hat{Y}- Y \\ & =& A^{[L]} - Y \end{array}$

For those who are interested in how dZ turns out like that, see the calculus notes below.

## Backprop for Layer l

Calculate the following for each layer, starting with layer L down to layer 1.

For layer L, calculate dZ as explained in Initialization section above. For other layers,  calculate dZ as:

$dZ^{[l]} = W^{[l+1]}{^{\top}} \cdot dZ^{[l+1]} * g^{[l]}{'}(Z^{[l]})$

Or if you prefer, we can calculate dZ in two steps, by first calculating dA:

$dA^{[l]} = W^{[l+1]}{^{\top}} \cdot dZ^{[l+1]}$

And then dZ:

$dZ^{[l]} = dA^{[l]} * g^{[l]}{'}(Z^{[l]})$

Once we have dZ, we can calculate the gradient of the parameters and bias as follows:

$dW^{[l]} = \frac{1}{m} dZ^{[l]} \cdot A^{[l-1]}{^{\top}}$

$db^{[l]} = \frac{1}{m} np.sum( dZ^{[l]}, axis=1, keepdims=1 )$

### Calculus Notes

For each variable dz, da, dW, and db, dx means the derivative of the loss function L with regard to x. So:

• dz means the derivative of the loss function L with regard to z, i.e. $\frac{\text{d}L}{\text{d}z}$
• da means the derivative of the loss function L with regard to a, i.e. $\frac{\text{d}L}{\text{d}a}$
• and so on.

Note that the capital letter version (dZ, dA) just means that it is the vectorized version (for all samples).

The loss function L(a,y) for binary classification is defined above in Loss Function section above, and for multi-class classification it will be defined differently later in Multiclass Classification section.

From L(a, y) for binary classification, we calculate da as follows.

$\begin{array}{{lcl}}da & =& \frac{\text{d}}{\text{d}a} L(a,y)\\ & = & \frac{\text{d}}{\text{d}a} (-y \log a - (1-y)\log(1-a)) \\ & = & -\frac{y}{a} + \frac{1-y}{1-a} \end{array}$

dz is calculated as:

$\begin{array} {lcl} dz & = & \frac{\partial L}{\partial z} \\ & = & \frac{\partial L}{\partial a} \cdot \frac{\text{d}a}{\text{d}z} \ \ \text{(note: chain rule)} \\ & = & \mathit{da} \ \cdot \frac{\text{d}}{\text{d}z}g(z) \\ & = & \mathit{da} \ \cdot g'(z) \\ & = & (- \frac{y}{a} + \frac{1-y}{1-a}) \ \cdot (a(1-a))\\ & = & (- \frac{y.a(1-a)}{a} + \frac{(1-y).a(1-a)}{1-a}) \\ & = & (- y(1-a) + (1-y).a) \\ & = & -y + ya + a - ya \\ & = & a - y \end{array}$

dW is calculated as:

$\begin{array}{{lcl}}dW & =& \frac{\text{d}}{\text{d}W} L(a,y)\\ & = & \frac{\text{d}L}{\text{d}z} \cdot \frac{\text{d}z}{\text{d}W} \\ & = & dz \cdot \frac{\text{d}}{\text{d}W} (W \cdot a + b) \\ & = & dz \cdot a \end{array}$

Note that there is a subtle note in dW = dza formula above, in that a is the activation vector of the previous layer, hence the exact formula with layer index is:

$dW^{[l]} = dz^{[l]} \cdot a^{[l-1]}$

Similarly for db:

$\begin{array}{{lcl}}db & =& \frac{\text{d}}{\text{d}b} L(a,y)\\ & = & \frac{\text{d}L}{\text{d}z} \cdot \frac{\text{d}z}{\text{d}b} \\ & = & dz \cdot \frac{\text{d}}{\text{d}b} (W \cdot a + b) \\ & = & dz \cdot 1 \\ & = & dz \end{array}$

## Updating the Parameters

Once we get dW and db we can update the parameters:

$W^{[l]} = W^{[l]} - \alpha \ dW^{[l]}$

$b^{[l]} = b^{[l]} - \alpha \ db^{[l]}$

The learning rate α is a hyperparameter which value need to be fine tuned. This will be explained in the Tuning section below.

~

# Multi-class Classification

With multi-class classification, the output layer now has more than 1 outputs. Suppose we have three classes:

• class 1: cat
• class 2: bird
• class 3: other

The neural network may look like this. Just notice that the output layer has three outputs:

Note:

Do not confuse multi-class classification and multi-label classification. In multi-class classification, we have multiple classes and the model predicts one class that is the most likely representation of the input. In multi-label classification, we also have multiple classes, but now the model can predict more than one classes that may be present in the input (e.g. an image may contain both cat and bird).

The multi-label model is also called multi-task learning.

## Forward Propagation

For the training labels in multiclass classification setup, we need to convert y (a scalar number) into y as vector. For example, for three classes above, each value is converted into vector as follows:

$y=[1] \ \text{ becomes } \ y = \left[\begin{array}{cc}1\\ 0\\ 0 \end{array}\right]$

$y=[2] \ \text{ becomes } \ y = \left[\begin{array}{cc}0\\ 1\\ 0 \end{array}\right]$

$y=[3] \ \text{ becomes } \ y = \left[\begin{array}{cc}0\\ 0\\ 1 \end{array}\right]$

For forward propagation, calculate Z as usual:

$Z^{[l]} = W^{[l]} \cdot A^{[l-1]} + b^{[l]}$

Then calculate vector t as:

$t = e^{(z^{[L]})}$

Visually, t look like this (for a three class output):

$t = \left[\begin{array}{cc}t_1\\ t_2\\ t_3 \end{array}\right] = \left[\begin{array}{cc}e^{(z_1)}\\ e^{(z_2)}\\ e^{(z_2)} \end{array}\right]$

With this we can calculate the activation vector a[L] as:

$a^{[L]} = \frac {t}{ sum(t) }$

Note that the above is a vector operation (t is a vector).

Visually, a looks like this (for a three class output):

$a^{[L]} = \left[\begin{array}{cc} t_1 / (t_1+t_2+t_3)\\ t_2 / (t_1+t_2+t_3)\\ t_3 / (t_1+t_2+t_3) \end{array}\right]$

And once we’re done with the calculation, each value corresponds to the probability that the input is the same class as the unit. Note that the sum of the probabilities is 1. So we may end up with values like this:

$a^{[L]} = \left[\begin{array}{cc} 0.2 \\ 0.5 \\ 0.3 \end{array}\right]$

which means that the most likely output is class 2 since it has the biggest probability.

## Cost Function

### Softmax Loss Function

$L(\hat{y}, y) = - \sum_{j=1}^{n^{[L]}} y_j\ log \ \hat{y_j}$

### Softmax Cost Function

$J(W^{[1]}, b^{[1]}, ...) = \sum_{i=1}^{m} L(\hat{y}^{(i)}, y^{(i)})$

## Backward Propagation

Since multiclass classification only differs with the binary classification NN in the last layer/output layer (i.e. the output has more than one nodes and has softmax activation function instead), then only the first part of gradient descent calculation is different. Specifically, the dZ[L] calculation is different.

Unfortunately the prof didn’t say how it should be calculated, because he said it will be done by the framework (Tensorflow). Will update once I have more info.

~

# Tuning

## Optimizing Bias vs Variance

High bias means the model underfits the data. High variance means the model overfits the data.

We need to monitor our training set error vs dev set error. Establish the optimal error, which is also called the Bayes error, which is the theoretical best error that can be achieved. Alternatively, since establishing Bayes error is difficult, we can also compare with human level performance.

Don’t just look at the raw error numbers, but rather compare them with the baseline error (either Bayes or human level error). The difference between training error and the baseline is called “avoidable bias”. The difference between dev error and training error is called the “variance“.

Here are some examples:

• suppose the baseline error is 1%
• if training error is 7% and dev set error is 10%, that means the avoidable bias is 6% and the variance is 3%. We should fix the bias since it is larger.
• on the other hand, suppose the baseline error is 6% for the same problem:
• this means the avoidable bias is 1% and the variance is (still) 3%. We should optimize the variance instead.

The basic recipe for training or optimizing the model is as follows:

1. Does it have high bias? If yes, then fix it (always fix high bias problem first).
1. Traing bigger model
2. Train longer/use better optimization algorithms (momentum, RMSprop, Adam, etc.)
3. Maybe use different NN architecture/hyperparameters search.
2. One the bias is good, fix any high variance issues.
1. Get more data, becaues training on more data will help generalize better
2. Use regularization (L2, dropout, data augmentation, etc.)
3. Maybe use different NN architecture/hyperparameters search.

## Regularization

Regularization is a way to fix high variance (overfit) problem. The intuition for regularization is it penalizes W for being too large, and this makes the network simpler.

### Cost Function

For neural network, the formula to add regularization to the cost function is:

$J(W^{[1]},b^{[1]},...,W^{[L]},b^{[L]}) = (\text{the original cost function}) + \frac{\lambda}{2m} \sum_{l=1}^{L} \Vert W^{[l]} \Vert _F ^ 2$

or:

$J(W^{[1]},b^{[1]},...,W^{[L]},b^{[L]}) = \frac{1}{m} \sum_{i=1}^m L(\hat{y}^{(i)}, y^{(i)}) + \frac{\lambda}{2m} \sum_{l=1}^{L} \Vert W^{[l]} \Vert _F ^ 2$

where $\Vert W^{[l]} \Vert _F ^ 2$ is called Frobenius norm:

$\Vert W^{[l]} \Vert _F ^ 2 = \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (W_{ij}^{[l]})^2$

Practically, this simply just adds all the weights in the network:

$reg = \frac{\lambda}{2m} \sum_{l=1}^{L} \sum_{i=1}^{n^{[l]}} \sum_{j=1}^{n^{[l-1]}} (W_{ij}^{[l]})^2$

Note that the bias is omitted from the regularization, but you can include it if you want, by adding:

$\frac{\lambda}{2m} \sum_{l=1}^{L} \sum_{i=1}^{n^{[l]}} (b_{i}^{[l]})^2$

$dW^{[l]} = \text{(from backprop)} + \frac{\lambda}{m} W^{[l]}$

The update formula is the same:

$W^{[l]} = W^{[l]} - \alpha \ dW^{[l]}$

L2 regularization is also called “weight decay” because the way it “decays” the W in the update formula above by $(1-\frac{\alpha \lambda}{m})$ factor.

## Dropout Regularization

The idea is to drop units randomly during training (but not during test time). The intuition is that by dropping the nodes it makes the subsequent nodes less dependent on that node; nodes can’t rely on specific nodes because they may be dropped at any time.

We define a probability ratio called keep_prob to indicate the probability that a unit is kept. The keep_prob can be set differently for each layer, for example, bigger layer can be set to have more probability for being dropped than smaller layer. Then randomly “drop” units during the iteration with probability (1 – keep_prob). We drop different units on different iteration/mini-batch.

### Forward Propagation

For example we’re working on layer 3 (this is reflected in variable names).

1) Define the keep_prob, say 80% probability:

keep_prob = 0.8

2) Create array containing “mask” to drop units. Note that the “< keep_prob” part of the expression will convert array of probability values into array of booleans (True or False).

d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob

3) Drop the units. In this expression, Numpy treats these booleans as 0 or 1, effectively zeroing (“dropping”) the a3 nodes when the corresponding d3 nodes are zero.

 a3 = a3 * d3

4) Scale the value back to the original scale. This is called “inverted dropout“:

a3 /= keep_prob

We can also merge step 4 into step 2 to make the process slightly  shorter:

d3 = (np.random.rand(a3.shape[0], a3.shape[1]) < keep_prob) / keep_prob

### Backward Propagation

It’s not explained in the course. Need to trace the programming exercise. TBD.

### Test Time

Do not drop units during test time! Dropout is only done during training.

## Data Augmentation

Another way to fix overfitting is to add more data. If this is not possible (or is too expensive), we can augment the data to make new data, but this is not as good as getting fresh data of course:

## Early Stopping

Plot cost function from training set vs dev set, and stop the training if we detect that we are starting to overfit the training data.

Critic for early stopping method is that it mixes two processes which should otherwise be independent: optimize cost function and not overfit. Optimizing cost function should have been the primary objective of our training, and stopping early messes up with this objective. The solution is to use regularization.

## Normalizing Input

Normalizing the input (x) makes the features lay at the same scale and this makes the training goes faster.

$\mu = avg(x) \\ \sigma^2 = \frac{1}{m} \sum_{i=1}^m x^{(i)} * x^{(i)} \\ x = \frac{x - \mu}{\sigma ^ 2}$

Note: should we divide by the standard deviation (i.e. square root of the variance) instead?

## Weight Initialization

Large weight values make Z value large, which slow down gradient descent, because with sigmoid or tanh activation functions, the gradient of the activation function becomes very small as Z becomes very big (positive or negative).

More units in the layer and more layers in the network make this problem worse  too. This is known as vanishing/exploding gradient problem, because if we have weight that is >1, that will be multiplied by as many times as the number of layers, making it “explode”. And similarly when it’s small, making it “vanish”. So choosing good initial weight values is desirable.

### Typical Initialization

One typical weight initialization is to scale them down with some constants:

W[l] = np.random.rand(shape) * 0.01

### Divide by Variance

A reasonable implementation is to scale with the variance of number of features:

W[l] = np.random.rand(shape) * np.sqrt( 1 / n[l-1] )

Where n[l-1] indicates the number of units in the previous layer. This is used especially when the activation function is tanh, and this method is called Xavier initialization.

If ReLU activation is used, this is said to work better (based on a paper by He at all):

W[l] = np.random.rand(shape) * np.sqrt( 2 / n[l-1] )

Gradient checking is used to double-check that our gradient calculation is correct. Use it sparingly as it is very slow.

For each feature i, calculate the approximate gradient (remember that gradient is the slope of the function):

$d\theta_{approx} = \frac{J(\theta_i + \epsilon) - J(\theta_i - \epsilon)}{2 \epsilon}$

Epsilon is a very small value, say 1e-7.

Then calculate the difference with the actual gradient calculated during gradient descent:

$\mathit{diff} = \frac{ \Vert d\theta_{approx} - d\theta \Vert _2 }{ \Vert d\theta_{approx} \Vert _2 + \Vert d\theta \Vert _2}$

For a small epsilon (~1e-7), the difference is expected to be small (< 1e-7).

The algorithm to do this with Numpy is as follows.

for each i:
Jplus = J(θi+ε)
Jminus = J(θi-ε)
gradapprox = (Jplus - Jminus) / (2*ε)

diff = numerator / denominator    

Divide the training examples into mini-batches of say 64, 128, 512, 1000 samples each, and run one iteration on these mini-batches instead of the whole sample. If we have very large number of samples this should make the training converge faster because otherwise (with the batch method) the weight is only updated once we process the whole sample.

At another extreme, we can train one sample at a time (called stochastic gradient descent), but this would be too slow as it defeats the benefit of vectorization.

With this method, we update W (and b) with the exponentially weighted running average of dW (and db) instead. This will work faster than the standard gradient descent. The intuition is that by averaging the values, it smoothens the oscilation in the gradient descent (say if dW value of one particular feature in one iteration is +10, and the next iteration is +10, the average will be 0 and it cancels out the oscilation on that particular feature, the the descent will “focus” on the right direction towards the minimum).

Here VdW is an exponentially weighted moving average of dW. It’s initialized to 0 and updated on every iteration:

$V_{dW} = \beta V_{dW} + (1 - \beta) dW$

Similarly for Vdb:

$V_{db} = \beta V_{db} + (1 - \beta) db$

The hyperparameter β is usually 0.9.

Updating the parameters then are done with VdW and Vdb:

$W = W - \alpha \ V_{dW}$

$b = b - \alpha \ V_{db}$

Optionally, bias correction may be applied to fix the first few values of the moving average:

$V_{dW} = \frac{V_{dW}}{1 - \beta^t}$

$V_{db} = \frac{V_{db}}{1 - \beta^t}$

## RMSprop (Root Mean Square Prop)

The intuition is to dampen large oscilation in some particular directions by penalizing movements that are large (dividing it by square root of itself). The formulas are:

$S_{dW} = \beta S_{dW} + (1 - \beta) (dW)^2$

$S_{db} = \beta S_{db} + (1 - \beta) (db)^2$

$W = W - \alpha \frac{dW}{\sqrt{S_{dW}} + \epsilon}$

$b = b - \alpha \frac{db}{\sqrt{S_{db}} + \epsilon}$

where β is typically 0.999 and epsilon ϵ is very small (e.g. 10-8) and is used to avoid division by zero.

This is the momentum part with bias correction:

$V_{dW} = \beta_1 V_{dW} + (1 - \beta_1) dW$

$V_{dW}^{\mathit{corr}} = \frac{V_{dW}}{1 - \beta_1^t}$

$V_{db} = \beta_1 V_{db} + (1 - \beta_1) db$

$V_{db}^{\mathit{corr}} = \frac{V_{db}}{1 - \beta_1^t}$

This is the RMSprop part with bias correction:

$S_{dW} = \beta_2 S_{dW} + (1 - \beta_2) (dW)^2$

$S_{dW}^{\mathit{corr}} = \frac{S_{dW}}{1 - \beta_2^t}$

$S_{db} = \beta_2 S_{db} + (1 - \beta_2) (db)^2$

$S_{db}^{\mathit{corr}} = \frac{S_{db}}{1 - \beta_2^t}$

$W = W - \alpha \frac{V_{dW}^{\mathit{corr}} }{\sqrt{S_{dW}^{\mathit{corr}}} + \epsilon}$

$b = b - \alpha \frac{V_{db}^{\mathit{corr}} }{\sqrt{S_{db}^{\mathit{corr}}} + \epsilon}$

The typical hyperparameter values are β1 = 0.9, β2 = 0.999, learning rate α needs to be tuned, and epsilon ϵ is 10-8.

## Learning Rate Decay

Learning rate decay can be used to speed up learning but typically it’s not that important (lower down in priority list).

There are many ways to decay the learning rate. For example it can be calculated as follows, with α0=0.2 and decay_rate=1:

$\alpha = \frac{1}{1 + \mathit{decay\_rate} * \mathit{epoch\_num}} \ \alpha_0$

Another one called exponential decay:

$\alpha = 0.95^{\mathit{epoch\_num}} \ \alpha_0$

Another one:

$\alpha = \frac{k}{\sqrt{\mathit{epoch\_num}}} \ \alpha_0$

Another one (t is mini-batch number):

$\alpha = \frac{k}{\sqrt{\mathit{t}}} \ \alpha_0$

Or alternatively, do manual decay (manually decrease learning rate based on manual observation).

## Batch Normalization

The idea is to normalize the value of Z in any hidden layers so that the values don’t fluctuate much between iterations, and this makes the subsequent layers happier.

### Training Stage

For each hidden layer l in the network, for each iteration:

Calculate the mean:

$\mu = \frac{1}{m} \sum_{i=1}^{m} z^{(i)}$

where m can be the number of samples or the number of samples in a mini-batch. In fact batch normalization is probably used more with mini-batches.

Then calculate the variance:

$\sigma^2 = \frac{1}{m} \sum_{i=1}^{m} (z^{(i)} - \mu)^2$

And normalize z:

$z_{\mathit{norm}}^{(i)} = \frac {z^{(i)} - \mu}{ \sqrt{\sigma^2 + \epsilon}}$

By now we have z which values are centered around 0 and with variance σ2. But this is not desirable, so we modify them as follows:

$\widetilde{z}^{(i)} = \gamma \ z_{norm}^{(i)} + \beta$

where γ and β are learnable parameters.

### Test Time

Since usually we predict one sample at the time during test time, there is specific calculation that needs to be done to get the mean and variance.

Usually people estimate the mean and σ2 as the exponentially weighted average of the previous mini-batches.

~

# Structuring Machine Learning Projects

## Splitting the Data

Split the samples into training set, development/dev set (a.k.a hold-out cross validation set), and test set. It’s important to have separate dev and test set. The dev set is used as the target for the optimization training. Once training is finished and the model is done, the test set is then used as the unbiased measurement for the model’s performance. Since we are iterating to get the best performance on the dev set, it is possible to overfit the dev set, and the test set is used to detect this. If there is overfit to the dev set, one way to fix is to have larger dev set.

Traditionally the data is split in 60:20:20 proportion, or 70:30 if we don’t have test set. But in this big data era, if we have very big data (>1 million samples), maybe 1% (=10,000) is enough for both dev and test sets.

When splitting the dataset, make sure the data in dev and test sets come from the same distribution. It is not so much of a problem when the training and dev set distributions are different. The problem is when dev and test set distributions are different. It’s like optimizing the model to hit a certain target (the dev set), and then use different target to test it.

## Test Set from Different Distribution

Suppose we’re building cat classifier for mobile app. We have 200,000 (high resolution) cat images downloaded from the web. We also have 10,000 images of cats taken by using mobile camera and uploaded by the users. Obviously these 10,000 images are the data that we care most since it represents the actual use case of the app. How to split the data effectively?

One way to do it is to merge both data randomly and split them into training, dev, and test sets normally as before. But this is a bad way to do it because our 10,000 images will have very little impact on the training and the test set will have little resemblance with the actual use case (we have >95% of high quality web images which do not exist in the real use case).

A better way to do it is to split the data as follows:

• training set consists of 200,000 images from the web + 10,000 images from users
• dev set consists of 5,000 images from users
• test set consists of 5,000 images from users

But then our training will have different distribution than the dev/test sets. The bias and variance analysis will need to change (see below).

## Bias and Variance with Mismatched Data Distribution

When our training set and dev set have different distribution, we may have the following results:

• training error: 1%
• dev error: 10%

It is difficult to conclude if the error difference is due to variance or because of different distribution. To fix this, add another set, training-dev set, which has the same distribution as the training set, but the model is not trained on this set.

Then we can analyze the errors more accurately.

For example, continuing with the previous sample (where we have 10% dev error):

• case 1:
• human level error: 0%
• training error: 1%
• training-dev error: 9%
• dev error: 10%
• With these errors, we can conclude that we have high variance.
• case 2:
• human level error: 0%
• training error: 1%
• training-dev error: 1.5%
• dev error: 10%
• With these errors, we can conclude that the error is due to data mismatch

## Fixing Data Mismatch

• Carry out manual error analysis to try to understand difference between training and dev/test sets
• Make training data more similar
• collect more data similar to dev/test sets
• synthetize training data, but be careful so that the training doesn’t overfit this synthetized data (for example, we may be able to synthetize thousands of car images from a game, but maybe there are only 20 types of cars in the game)

## Orthogonalization

When fixing machine learning problem (e.g. fixing the bias, fixing the variance), pick a method that solves that specific problem, rather a method that affects a lot of aspects of the learning at the same time. This will confuse our results.

## Single Evaluation Metric

If we have many evaluation metrics (e.g. precision and recall, results across many countries, etc.), it’s better to combine them into one single metric (e.g. F1 score, the weighted average, etc.) so that the results are easier to compare.

The metric must reflect the real world objective for the product (e.g. user satisfaction). If the metric no longer reflects that, it must be changed.

## Satisficing and Optimizing Metrics

Satisficing metric means metric that cannot be violated. Optimizing metric means metric that is optimized to work better.

For example, we want classifier with the best accuracy which run-time cannot exceed 100ms. The 100ms restriction is the satisficing metric, while the accuracy is the optimizing metric.

## Error Analysis

To improve accuracy, analyse the samples that are predicted wrongly by our model. Find the reason why they are wrongly predicted, and focus on fixing error that will improve the accuracy most significantly.

For example, we have 90% accuracy to our cat detection model, which is equal to 10% error. By doing error analysis, suppose we counted that 10% of this error is because the model is misclassifying dogs, 50% is because the image is blurry, 10% is because it is misclassifying big cats, and 30% is due to other reasons. By having this error analysis, we now know where we will make significant improvement to our accuracy.

## Incorrectly Labelled Data

During error analysis, if we find that we have many incorrectly labelled data, we may decide to fix those samples (depending on how much improvement we can expect by doing it). If we do so, make sure to also fix the test set, so that the dev and test sets continue to have the same distribution.

## Build Something Quickly, then Iterate

The general advise on building new machine learning project is to set up dev/test sets and metric, and build the initial system quickly (don’t overthink it), and then iterate on it based on bias/variance analysis and error analysis above to prioritize what to fix next.

For application areas on which we have significant prior experience, or significant body of academic literature for the exact same problem, then it is okay to build a more complex system based on these literature.

## Transfer Learning

Sometimes it is possible to reuse a model trained to do one task to do another task. For example, reuse a model that has been trained to do cat detection to do radiology image classification.

This kind of reuse is called transfer learning. The way to do it is to delete the last layer of the network (and all the weights) and replace with a fresh new layer and new set of random weights.

If the new dataset is small, it’s best to only train the last layer and keep all the other layers fixed. If the new dataset is large, we can retrain the whole network. If we do the latter, then the training for the initial task (the cat detection) is called pre-training, while the training for the new task is called fine tuning.

Rough guidelines on when transfer learning (from task A to task B) makes sense:

• task A and B have the same input x
• we have a lot more data for task A than task B
• low level features from task A may be helpful for learning B