Batch Normalization

Understanding Batch Normalization with Codes Explained

Riley Learning

02 Jun 2022 • 3 min read

Understanding Batch Normalization with Codes Explained

What is Batch Normalization?

Normalization means normalizing the data dimensions so that they are of approximately the same scale. Batch normalization is a technique for training very deep neural networks that normalizes the contributions to a layer for every mini-batch. Batch Norm is an essential part of the most of deep learning implementation.

Batch Normalization is normalizing the hidden units activation values so that the distribution of these activations remains same during training. Batch Normalization has the impact of settling the learning process and drastically decreasing the number of training epochs required to train deep neural networks.

We will assume that X is of size [N x D]. N is the number of data and D is their dimensionality. To get variance, we need to subtract the mean across every individual feature in the data.

Gamma and Beta are learned along with the other parameters of the network. There would be update rule for gamma and beta and this update rule would depend upon the derivative of the loss function with respect to gamma and beta.

Advantages of Batch Normalization

Here are some benefits of Batch Normalization:

The model is less delicate to hyperparameter tuning. That is, though bigger learning rates prompted non-valuable models already, bigger LRs are satisfactory at this point.
Shrinks internal covariant shift.
Diminishes the reliance of gradients on the scale of the parameters or their underlying values.
Weight initialization is a smidgen less significant at this point.
Dropout can be evacuated for regularization.

Batch Normalization Layer

Batch Normalization normalizes batch data (output of Affine or Conv) before activation. Batch Normalization Layer is consist of forward pass and backward pass. The codes are from the book ‘Deep Learning From Scratch’, published in September 2019.

Initialization of Batch Normalization Layer

Temporary variables in forward() and backward() defined in the initialization part.

class BatchNormalization:

def __init__(self, gamma, beta, momentum=0.9, running_mean=None, running_var=None): 
        self.gamma = gamma # scale after normalization 
        self.beta = beta # shift after normalization 
        self.momentum = momentum # smoothing factor of EMA, momentum * newest value 
         
				# temporary variables in forward() 
				self.input_shape = None # Conv:4dim, Affine:2dim

# mean and variance for test session  
        self.running_mean = running_mean  
        self.running_var = running_var   
         
        # temporary variables in backward() 
        self.batch_size = None 
        self.xc = None # centered batch data 
        self.std = None 
        self.dgamma = None 
        self.dbeta = None

Forward Pass of Batch Normalization Layer

Forward Pass has two parts. The second one(__forward) is the actual forward process of batch normalization.

def forward(self, x, train_flg=True): #train_flg=False when testing  
        self.input_shape = x.shape # hold original shape here 
        if x.ndim != 2: # If the previous layer is not Affine(ndim=2) 
            N, C, H, W = x.shape 
            x = x.reshape(N, -1)  
        out = self.__forward(x, train_flg) 
        return out.reshape(*self.input_shape) # recover the shape of x after normalization

def __forward(self, x, train_flg): 
        if self.running_mean is None: 
            N, D = x.shape 
						# In the first iteration,setting moving average and var as 0. 
            self.running_mean = np.zeros(D) 
            self.running_var = np.zeros(D) 
                         
        if train_flg: # When training 
            mu = x.mean(axis=0) # find mean for each column 
            xc = x - mu # mean subtraction  
            var = np.mean(xc**2, axis=0) 
            std = np.sqrt(var + 10e-7) # + 10e-7 protects the vale from 0 
            xn = xc / std #normalize 
             
            self.batch_size = x.shape[0] 
            self.xc = xc 
            self.xn = xn 
            self.std = std 
            self.running_mean = self.momentum * self.running_mean + (1-self.momentum) * mu 
            self.running_var = self.momentum * self.running_var + (1-self.momentum) * var             
        else: # When testing 
            xc = x - self.running_mean 
            xn = xc / ((np.sqrt(self.running_var + 10e-7))) 
             
        out = self.gamma * xn + self.beta # gamma and beta are defined after training 
        return out

Backward Pass of Batch Normalization Layer

This process is to find dgamma and dbeta, which is used to update beta and gamma.

def backward(self, dout): 
        if dout.ndim != 2: 
            N, C, H, W = dout.shape 
            dout = dout.reshape(N, -1)

dx = self.__backward(dout)

dx = dx.reshape(*self.input_shape) 
        return dx

def __backward(self, dout): 
        dbeta = dout.sum(axis=0) 
        dgamma = np.sum(self.xn * dout, axis=0) # Hadamard product 
        dxn = self.gamma * dout 
        dxc = dxn / self.std  
        dstd = -np.sum((dxn * self.xc) / (self.std * self.std), axis=0) 
        dvar = 0.5 * dstd / self.std 
        dxc += (2.0 / self.batch_size) * self.xc * dvar 
        dmu = np.sum(dxc, axis=0) 
        dx = dxc - dmu / self.batch_size 
         
        self.dgamma = dgamma 
        self.dbeta = dbeta 
         
        return dx

Batch Norm is a very useful layer. If you are interested in Deep Learning, you will for sure have to get familiar with this method. I hope this post gives you a good understanding of how Batch Norm works.