Google ISLR Transformer with W&B (Part 2)

In this article, I’ll be showing you how to create and train a model for the Kaggle ASL (American Sign Language) recognition competition…

Google ISLR Transformer with W&B (Part 2)

In this article, I’ll be showing you how to create and train a model for the Kaggle ASL (American Sign Language) recognition competition, building on my previous post. I’ve used and modified Mark Wijkhuizen’s code and monitored the training process using W&B, and you can check out the full code and my W&B dashboard in this article. Let’s dive in!

Global Config & Utils

cfg = SimpleNamespace() 
# If True, processing data from scratch 
# If False, loads preprocessed data 
cfg.PREPROCESS_DATA = False  
cfg.TRAIN_MODEL = True 
# True: use 10% of participants as validation set 
# False: use all data for training  
cfg.USE_VAL = False 
 
cfg.N_ROWS = 543 
cfg.N_DIMS = 3 
cfg.DIM_NAMES = ['x', 'y', 'z'] 
cfg.SEED = 42 
cfg.NUM_CLASSES = 250 
 
cfg.IS_INTERACTIVE = True 
cfg.VERBOSE = 2 
cfg.INPUT_SIZE = 64 
 
cfg.BATCH_ALL_SIGNS_N = 4 
cfg.BATCH_SIZE = 256 
cfg.N_EPOCHS = 100 
cfg.LR_MAX = 1e-3 
cfg.N_WARMUP_EPOCHS = 0 
 
# Dropout Ratio 
cfg.WD_RATIO = 0.05 
 
#cfg.MASK_VAL = 4237

Here’s an explanation of the parameters in cfg:

  • PREPROCESS_DATA: a boolean value that indicates whether to process data from scratch or to load preprocessed data
  • TRAIN_MODEL: a boolean value that indicates whether to train the model or not
  • USE_VAL: a boolean value that indicates whether to use 10% of participants as a validation set or to use all data for training
  • N_ROWS: the number of rows in the input dataset
  • N_DIMS: the number of dimensions in the input dataset
  • DIM_NAMES: a list of strings that represents the names of the dimensions in the input dataset
  • SEED: the random seed for reproducibility
  • NUM_CLASSES: the number of classes in the output
  • IS_INTERACTIVE: a boolean value that indicates whether to run in interactive mode
  • VERBOSE: the verbosity level (0 = silent, 1 = progress bar, 2 = one line per epoch)
  • INPUT_SIZE: the input size for the model. If the input size is bigger, it can provide more information to the model, and the model can learn more complex and accurate representations of the input data. However, I experienced a SCORING ERROR when I increased it to 128. It’s probably because the increased INPUT_SIZE exceeded the model size required for this competition.
  • BATCH_ALL_SIGNS_N: the number of signs to use in each batch during training
  • BATCH_SIZE: the batch size for training
  • N_EPOCHS: the number of epochs to train
  • LR_MAX: the maximum learning rate for the one-cycle learning rate policy
  • N_WARMUP_EPOCHS: the number of warm-up epochs for the one-cycle learning rate policy
  • WD_RATIO: the weight decay ratio
  • MASK_VAL: a value used for masking, but commented out in this code block

Model Config

# Epsilon value for layer normalisation 
LAYER_NORM_EPS = 1e-6 
 
# Dense layer units for landmarks 
LIPS_UNITS = 384 
HANDS_UNITS = 384 
POSE_UNITS = 384 
 
# Final embedding and transformer embedding size 
UNITS = 512 
 
# Transformer 
NUM_BLOCKS = 2 
MLP_RATIO = 2 
 
# Dropout 
EMBEDDING_DROPOUT = 0.00 
MLP_DROPOUT_RATIO = 0.30 
CLASSIFIER_DROPOUT_RATIO = 0.10 
 
# Initiailizers 
INIT_HE_UNIFORM = tf.keras.initializers.he_uniform 
INIT_GLOROT_UNIFORM = tf.keras.initializers.glorot_uniform 
INIT_ZEROS = tf.keras.initializers.constant(0.0) 
 
# Activations 
GELU = tf.keras.activations.gelu 
 
print(f'UNITS: {UNITS}') 
 
''' 
UNITS: 512 
'''

Firstly, we define various hyperparameters and print the value of UNITS. Let’s dive into the hyperparameters used:

  1. LAYER_NORM_EPS: This is a small value added to the denominator in the Layer Normalization operation to avoid division by zero. In this case, the value used is 1e-6.
  2. LIPS_UNITS, HANDS_UNITS, and POSE_UNITS: These are the number of units in the Dense layers used to extract features from the landmarks of the lips, hands, and pose, respectively. The values used are 384 for each.
  3. UNITS: This is the size of the final embedding and the transformer embedding. In this case, the value used is 512, which is the size of the final embedding and transformer embedding. This means that each input token will be represented by a vector of size 512. The performance of the model may be affected by the choice of this hyperparameter, as increasing the size of the embeddings may improve the model’s ability to capture complex relationships between input tokens, but may also lead to overfitting or increased computation time.
  4. NUM_BLOCKS: This is the number of Transformer blocks used in the model. In this case, the value used is 2. Increasing the number of blocks can allow for more complex and nuanced representations of the input data, potentially leading to improved accuracy on tasks such as natural language processing or image recognition.
  5. MLP_RATIO: This is the ratio of the number of hidden units in the Feedforward network of the Transformer block to the number of input units. In this case, the value used is 2.
  6. EMBEDDING_DROPOUT: This is the dropout rate applied to the input embeddings. In this case, no dropout is applied.
  7. MLP_DROPOUT_RATIO: This is the dropout rate applied to the output of the Feedforward network in the Transformer block. In this case, a dropout rate of 30% is used.
  8. CLASSIFIER_DROPOUT_RATIO: This is the dropout rate applied to the output of the final classifier. In this case, a dropout rate of 10% is used.
  9. INIT_HE_UNIFORM: This is the initializer used for the kernel weights in the Dense layers. In this case, the He uniform initializer is used.
  10. INIT_GLOROT_UNIFORM: This is the initializer used for the kernel weights in the Transformer blocks. In this case, the Glorot uniform initializer is used.
  11. INIT_ZEROS: This is the initializer used for the bias terms in the Dense layers. In this case, the constant initializer with a value of 0.0 is used.
  12. GELU: This is the activation function used in the Feedforward network of the Transformer block. In this case, the Gaussian Error Linear Units (GELU) activation function is used.

Transformer

The encoder part of the model consists of N identical layers, each of which can be further divided into the following steps:

  1. Embedding of the one-hot encoded input sample (word embedding)
  2. Addition of position encoding
  3. Multi-head self-attention mechanism
  4. Addition of the input and output of self-attention (residual network structure)
  5. Layer normalization of all time steps
  6. Feedforward neural network structure
  7. Addition of the input and output of the feedforward network (residual network structure)
  8. Layer normalization of all time steps
  9. Repeat steps 3–8 for N layers

The decoder part of the model is also N identical layers, each of which can be further divided into the following steps:

  1. Embedding of the one-hot encoded input sample (word embedding)
  2. Addition of position encoding
  3. Multi-head self-attention mechanism
  4. Addition of the input and output of self-attention (residual network structure)
  5. Layer normalization of all time steps
  6. Multi-head cross-attention mechanism with the encoder output as queries and keys and the previous decoder layer output as values
  7. Addition of the input and output of cross-attention (residual network structure)
  8. Layer normalization of all time steps
  9. Feedforward neural network structure
  10. Addition of the input and output of the feedforward network (residual network structure)
  11. Layer normalization of all time steps
  12. Repeat steps 3–11 for N layers

The output of the model is generated by passing the final output through a linear layer followed by a softmax activation function for prediction.

We need to use the scaled dot-product attention between the queries q, keys k, and values v for MultiHeadAttention. The function computes the weighted sum of the values using the attention scores by using tf.matmul with softmax and v. The output tensor has the same shape as the inputs q, k, and v.

The output of the model is generated by passing the final output through a linear layer followed by a softmax activation function for prediction.

We need to use the scaled dot-product attention between the queries q, keys k, and values v for MultiHeadAttention. The function computes the weighted sum of the values using the attention scores by using tf.matmul with softmax and v. The output tensor has the same shape as the inputs q, k, and v.

# based on: https://stackoverflow.com/questions/67342988/verifying-the-implementation-of-multihead-attention-in-transformer 
# replaced softmax with softmax layer to support masked softmax 
 
def scaled_dot_product(q,k,v, softmax, attention_mask): 
    #calculates Q . K(transpose) 
    qkt = tf.matmul(q,k,transpose_b=True) 
    #caculates scaling factor 
    dk = tf.math.sqrt(tf.cast(q.shape[-1],dtype=tf.float32)) 
    scaled_qkt = qkt/dk 
    softmax = softmax(scaled_qkt, mask=attention_mask) 
     
    z = tf.matmul(softmax,v) 
    #shape: (m,Tx,depth), same shape as q,k,v 
    return z

As an important component, MultiHeadAttention is implemented. It takes in d_model and num_of_heads arguments.

class MultiHeadAttention(tf.keras.layers.Layer): 
    def __init__(self,d_model,num_of_heads): 
        super(MultiHeadAttention,self).__init__() 
        self.d_model = d_model 
        self.num_of_heads = num_of_heads 
        self.depth = d_model//num_of_heads 
        self.wq = [tf.keras.layers.Dense(self.depth) for i in range(num_of_heads)] 
        self.wk = [tf.keras.layers.Dense(self.depth) for i in range(num_of_heads)] 
        self.wv = [tf.keras.layers.Dense(self.depth) for i in range(num_of_heads)] 
        self.wo = tf.keras.layers.Dense(d_model) 
        self.softmax = tf.keras.layers.Softmax() 
         
    def call(self,x, attention_mask): 
         
        multi_attn = [] 
        for i in range(self.num_of_heads): 
            Q = self.wq[i](x) 
            K = self.wk[i](x) 
            V = self.wv[i](x) 
            multi_attn.append(scaled_dot_product(Q,K,V, self.softmax, attention_mask)) 
             
        multi_head = tf.concat(multi_attn,axis=-1) 
        multi_head_attention = self.wo(multi_head) 
        return multi_head_attention
# Full Transformer 
class Transformer(tf.keras.Model): 
    def __init__(self, num_blocks): 
        super(Transformer, self).__init__(name='transformer') 
        self.num_blocks = num_blocks 
     
    def build(self, input_shape): 
        self.ln_1s = [] 
        self.mhas = [] 
        self.ln_2s = [] 
        self.mlps = [] 
        # Make Transformer Blocks 
        for i in range(self.num_blocks): 
            # Multi Head Attention 
            self.mhas.append(MultiHeadAttention(UNITS, 8)) 
            # Multi Layer Perception 
            self.mlps.append(tf.keras.Sequential([ 
                tf.keras.layers.Dense(UNITS * MLP_RATIO, activation=GELU, kernel_initializer=INIT_GLOROT_UNIFORM), 
                tf.keras.layers.Dropout(MLP_DROPOUT_RATIO), 
                tf.keras.layers.Dense(UNITS, kernel_initializer=INIT_HE_UNIFORM), 
            ])) 
         
    def call(self, x, attention_mask): 
        # Iterate input over transformer blocks 
        for mha, mlp in zip(self.mhas, self.mlps): 
            x = x + mha(x, attention_mask) 
            x = x + mlp(x) 
     
        return x

Embedding

Embedding is a learned representation of a categorical variable. In the Embedding class, the LandmarkEmbedding is used to embed the landmark data. The output of each LandmarkEmbedding is concatenated along the last dimension.

Landmark Embedding

The goal of LandmarkEmbedding is to represent landmark data in a continuous vector space, so that it can be easily used as input to downstream machine models.

The dense layer learns a representation of the landmark data by mapping it to a continuous vector space. This allows the downstream models to extract relevant features from the landmark data.

It also handles missing landmarks by providing a separate embedding for frames where certain landmarks are missing.

class LandmarkEmbedding(tf.keras.Model): 
    def __init__(self, units, name): 
        super(LandmarkEmbedding, self).__init__(name=f'{name}_embedding') 
        self.units = units 
         
    def build(self, input_shape): 
        # Embedding for missing landmark in frame, initizlied with zeros 
        self.empty_embedding = self.add_weight( 
            name=f'{self.name}_empty_embedding', 
            shape=[self.units], 
            initializer=INIT_ZEROS, 
        ) 
        # Embedding 
        self.dense = tf.keras.Sequential([ 
            tf.keras.layers.Dense(self.units, name=f'{self.name}_dense_1', use_bias=False, kernel_initializer=INIT_GLOROT_UNIFORM), 
            tf.keras.layers.Activation(GELU), 
            tf.keras.layers.Dense(self.units, name=f'{self.name}_dense_2', use_bias=False, kernel_initializer=INIT_HE_UNIFORM), 
        ], name=f'{self.name}_dense') 
 
    def call(self, x): 
        return tf.where( 
                # Checks whether landmark is missing in frame 
                tf.reduce_sum(x, axis=2, keepdims=True) == 0, 
                # If so, the empty embedding is used 
                self.empty_embedding, 
                # Otherwise the landmark data is embedded 
                self.dense(x), 
            )

Embedding

The purpose of Embedding is to embed various types of input data, such as lip landmark coordinates.

The positional embedding layer is initialized with zeros and is used to add positional information to the landmark embeddings. The weights for each type of landmark embedding are initialized with zeros and are learned during training. The fully connected layer is used to combine the embeddings of all landmarks.

class Embedding(tf.keras.Model): 
    def __init__(self): 
        super(Embedding, self).__init__() 
         
    def get_diffs(self, l): 
        S = l.shape[2] 
        other = tf.expand_dims(l, 3) 
        other = tf.repeat(other, S, axis=3) 
        other = tf.transpose(other, [0,1,3,2]) 
        diffs = tf.expand_dims(l, 3) - other 
        diffs = tf.reshape(diffs, [-1, cfg.INPUT_SIZE, S*S]) 
        return diffs 
 
    def build(self, input_shape): 
        # Positional Embedding, initialized with zeros 
        self.positional_embedding = tf.keras.layers.Embedding(cfg.INPUT_SIZE+1, UNITS, embeddings_initializer=INIT_ZEROS) 
        # Embedding layer for Landmarks 
        self.lips_embedding = LandmarkEmbedding(LIPS_UNITS, 'lips') 
        self.left_hand_embedding = LandmarkEmbedding(HANDS_UNITS, 'left_hand') 
        self.pose_embedding = LandmarkEmbedding(POSE_UNITS, 'pose') 
        # Landmark Weights 
        self.landmark_weights = tf.Variable(tf.zeros([3], dtype=tf.float32), name='landmark_weights') 
        # Fully Connected Layers for combined landmarks 
        self.fc = tf.keras.Sequential([ 
            tf.keras.layers.Dense(UNITS, name='fully_connected_1', use_bias=False, kernel_initializer=INIT_GLOROT_UNIFORM), 
            tf.keras.layers.Activation(GELU), 
            tf.keras.layers.Dense(UNITS, name='fully_connected_2', use_bias=False, kernel_initializer=INIT_HE_UNIFORM), 
        ], name='fc') 
 
 
    def call(self, lips0, left_hand0, pose0, non_empty_frame_idxs, training=False): 
        # Lips 
        lips_embedding = self.lips_embedding(lips0) 
        # Left Hand 
        left_hand_embedding = self.left_hand_embedding(left_hand0) 
        # Pose 
        pose_embedding = self.pose_embedding(pose0) 
        # Merge Embeddings of all landmarks with mean pooling 
        x = tf.stack(( 
            lips_embedding, left_hand_embedding, pose_embedding, 
        ), axis=3) 
        x = x * tf.nn.softmax(self.landmark_weights) 
        x = tf.reduce_sum(x, axis=3) 
        # Fully Connected Layers 
        x = self.fc(x) 
        # Add Positional Embedding 
        max_frame_idxs = tf.clip_by_value( 
                tf.reduce_max(non_empty_frame_idxs, axis=1, keepdims=True), 
                1, 
                np.PINF, 
            ) 
        normalised_non_empty_frame_idxs = tf.where( 
            tf.math.equal(non_empty_frame_idxs, -1.0), 
            cfg.INPUT_SIZE, 
            tf.cast( 
                non_empty_frame_idxs / max_frame_idxs * cfg.INPUT_SIZE, 
                tf.int32, 
            ), 
        ) 
        x = x + self.positional_embedding(normalised_non_empty_frame_idxs) 
         
        return x

Sparse Categorical Crossentropy With Label Smoothing

Sparse categorical cross-entropy is a loss function used for multi-class classification problems where the labels are integers instead of one-hot encoded vectors.

Label smoothing is a technique used to improve the generalization of a model by preventing it from becoming overconfident in its predictions. It involves smoothing the one-hot encoded target labels by replacing some of the 1’s with a small positive value and the rest with a small negative value.

# source:: https://stackoverflow.com/questions/60689185/label-smoothing-for-sparse-categorical-crossentropy 
def scce_with_ls(y_true, y_pred): 
    # One Hot Encode Sparsely Encoded Target Sign 
    y_true = tf.cast(y_true, tf.int32) 
    y_true = tf.one_hot(y_true, cfg.NUM_CLASSES, axis=1) 
    y_true = tf.squeeze(y_true, axis=2) 
    # Categorical Crossentropy with native label smoothing support 
    return tf.keras.losses.categorical_crossentropy(y_true, y_pred, label_smoothing=0.25)

Model

The model takes two inputs: “frames” and “non_empty_frame_idxs”. The “frames” input is a 4D tensor with dimensions [INPUT_SIZE, N_COLS, N_DIMS]. The “non_empty_frame_idxs” input is a 1D tensor with length “INPUT_SIZE”.

  • Padding mask: The function first creates a padding mask based on the “non_empty_frame_idxs” input, which is used to mask out any padding frames.
  • Normalizing frames: The function then slices the “frames” input into different body parts (lips, left hand, right hand, and pose), normalizes the data using mean and standard deviation values, and reshapes the tensors.
  • Embedding: The function then passes the body parts through an “Embedding” layer. This layer learns an embedding representation of the body part data, which is then passed to a Transformer layer with “NUM_BLOCKS” blocks.
  • Transformer, Pooling, and Classification: After the Transformer layer, the output is pooled to produce a single vector that is fed into a final dense layer with “NUM_CLASSES” output neurons.
  • Model, Loss, and Optimizer: The model is compiled using a categorical cross-entropy loss function and an Adam optimizer with weight decay
  • Metrics: There are three metrics: sparse categorical accuracy, sparse top-5 categorical accuracy, and sparse top-10 categorical accuracy.
def get_model(): 
    # Inputs 
    frames = tf.keras.layers.Input([cfg.INPUT_SIZE, N_COLS, cfg.N_DIMS], dtype=tf.float32, name='frames') 
    non_empty_frame_idxs = tf.keras.layers.Input([cfg.INPUT_SIZE], dtype=tf.float32, name='non_empty_frame_idxs') 
    # Padding Mask 
    mask0 = tf.cast(tf.math.not_equal(non_empty_frame_idxs, -1), tf.float32) 
    mask0 = tf.expand_dims(mask0, axis=2) 
    # Random Frame Masking 
    mask = tf.where( 
        (tf.random.uniform(tf.shape(mask0)) > 0.25) & tf.math.not_equal(mask0, 0.0), 
        1.0, 
        0.0, 
    ) 
    # Correct Samples Which are all masked now... 
    mask = tf.where( 
        tf.math.equal(tf.reduce_sum(mask, axis=[1,2], keepdims=True), 0.0), 
        mask0, 
        mask, 
    ) 
     
     
    """ 
        left_hand: 468:489 
        pose: 489:522 
        right_hand: 522:543 
    """ 
    x = frames 
    x = tf.slice(x, [0,0,0,0], [-1,cfg.INPUT_SIZE, N_COLS, 2]) 
    # LIPS 
    lips = tf.slice(x, [0,0,LIPS_START,0], [-1,cfg.INPUT_SIZE, 40, 2]) 
    lips = tf.where( 
            tf.math.equal(lips, 0.0), 
            0.0, 
            (lips - LIPS_MEAN) / LIPS_STD, 
        ) 
    # LEFT HAND 
    left_hand = tf.slice(x, [0,0,40,0], [-1,cfg.INPUT_SIZE, 21, 2]) 
    left_hand = tf.where( 
            tf.math.equal(left_hand, 0.0), 
            0.0, 
            (left_hand - LEFT_HANDS_MEAN) / LEFT_HANDS_STD, 
        ) 
    # POSE 
    pose = tf.slice(x, [0,0,61,0], [-1, cfg.INPUT_SIZE, 5, 2]) 
    pose = tf.where( 
            tf.math.equal(pose, 0.0), 
            0.0, 
            (pose - POSE_MEAN) / POSE_STD, 
        ) 
     
    # Flatten 
    lips = tf.reshape(lips, [-1, cfg.INPUT_SIZE, 40*2]) 
    left_hand = tf.reshape(left_hand, [-1, cfg.INPUT_SIZE, 21*2]) 
    pose = tf.reshape(pose, [-1, cfg.INPUT_SIZE, 5*2]) 
         
    # Embedding 
    x = Embedding()(lips, left_hand, pose, non_empty_frame_idxs) 
     
    # Encoder Transformer Blocks 
    x = Transformer(NUM_BLOCKS)(x, mask) 
     
    # Pooling 
    x = tf.reduce_sum(x * mask, axis=1) / tf.reduce_sum(mask, axis=1) 
    # Classifier Dropout 
    x = tf.keras.layers.Dropout(CLASSIFIER_DROPOUT_RATIO)(x) 
    # Classification Layer 
    x = tf.keras.layers.Dense(cfg.NUM_CLASSES, activation=tf.keras.activations.softmax, kernel_initializer=INIT_GLOROT_UNIFORM)(x) 
     
    outputs = x 
     
    # Create Tensorflow Model 
    model = tf.keras.models.Model(inputs=[frames, non_empty_frame_idxs], outputs=outputs) 
     
    # Sparse Categorical Cross Entropy With Label Smoothing 
    loss = scce_with_ls 
     
    # Adam Optimizer with weight decay 
    optimizer = tfa.optimizers.AdamW(learning_rate=1e-3, weight_decay=1e-5, clipnorm=1.0) 
     
    # TopK Metrics 
    metrics = [ 
        tf.keras.metrics.SparseCategoricalAccuracy(name='acc'), 
        tf.keras.metrics.SparseTopKCategoricalAccuracy(k=5, name='top_5_acc'), 
        tf.keras.metrics.SparseTopKCategoricalAccuracy(k=10, name='top_10_acc'), 
    ] 
     
    model.compile(loss=loss, optimizer=optimizer, metrics=metrics) 
     
    return model

And you can print a summary of the model.

tf.keras.backend.clear_session() 
model = get_model() 
 
# Plot model summary 
model.summary(expand_nested=True)
Model: "model" 
__________________________________________________________________________________________________ 
 Layer (type)                   Output Shape         Param #     Connected to                      
================================================================================================== 
 non_empty_frame_idxs (InputLay  [(None, 64)]        0           []                                
 er)                                                                                               
                                                                                                   
 tf.math.not_equal (TFOpLambda)  (None, 64)          0           ['non_empty_frame_idxs[0][0]']    
                                                                                                   
 tf.cast (TFOpLambda)           (None, 64)           0           ['tf.math.not_equal[0][0]']       
                                                                                                   
 tf.expand_dims (TFOpLambda)    (None, 64, 1)        0           ['tf.cast[0][0]']                 
                                                                                                   
 frames (InputLayer)            [(None, 64, 66, 3)]  0           []                                
                                                                                                   
 tf.compat.v1.shape (TFOpLambda  (3,)                0           ['tf.expand_dims[0][0]']          
 )                                                                                                 
                                                                                                   
 tf.slice (TFOpLambda)          (None, 64, 66, 2)    0           ['frames[0][0]']                  
                                                                                                   
 tf.random.uniform (TFOpLambda)  (None, 64, 1)       0           ['tf.compat.v1.shape[0][0]']      
                                                                                                   
 tf.slice_1 (TFOpLambda)        (None, 64, 40, 2)    0           ['tf.slice[0][0]']                
                                                                                                   
 tf.slice_2 (TFOpLambda)        (None, 64, 21, 2)    0           ['tf.slice[0][0]']                
                                                                                                   
 tf.slice_3 (TFOpLambda)        (None, 64, 5, 2)     0           ['tf.slice[0][0]']                
                                                                                                   
 tf.math.greater (TFOpLambda)   (None, 64, 1)        0           ['tf.random.uniform[0][0]']       
                                                                                                   
 tf.math.not_equal_1 (TFOpLambd  (None, 64, 1)       0           ['tf.expand_dims[0][0]']          
 a)                                                                                                
                                                                                                   
 tf.math.subtract (TFOpLambda)  (None, 64, 40, 2)    0           ['tf.slice_1[0][0]']              
                                                                                                   
 tf.math.subtract_1 (TFOpLambda  (None, 64, 21, 2)   0           ['tf.slice_2[0][0]']              
 )                                                                                                 
                                                                                                   
 tf.math.subtract_2 (TFOpLambda  (None, 64, 5, 2)    0           ['tf.slice_3[0][0]']              
 )                                                                                                 
                                                                                                   
 tf.math.logical_and (TFOpLambd  (None, 64, 1)       0           ['tf.math.greater[0][0]',         
 a)                                                               'tf.math.not_equal_1[0][0]']     
                                                                                                   
 tf.math.equal_1 (TFOpLambda)   (None, 64, 40, 2)    0           ['tf.slice_1[0][0]']              
                                                                                                   
 tf.math.truediv (TFOpLambda)   (None, 64, 40, 2)    0           ['tf.math.subtract[0][0]']        
                                                                                                   
 tf.math.equal_2 (TFOpLambda)   (None, 64, 21, 2)    0           ['tf.slice_2[0][0]']              
                                                                                                   
 tf.math.truediv_1 (TFOpLambda)  (None, 64, 21, 2)   0           ['tf.math.subtract_1[0][0]']      
                                                                                                   
 tf.math.equal_3 (TFOpLambda)   (None, 64, 5, 2)     0           ['tf.slice_3[0][0]']              
                                                                                                   
 tf.math.truediv_2 (TFOpLambda)  (None, 64, 5, 2)    0           ['tf.math.subtract_2[0][0]']      
                                                                                                   
 tf.where (TFOpLambda)          (None, 64, 1)        0           ['tf.math.logical_and[0][0]']     
                                                                                                   
 tf.where_2 (TFOpLambda)        (None, 64, 40, 2)    0           ['tf.math.equal_1[0][0]',         
                                                                  'tf.math.truediv[0][0]']         
                                                                                                   
 tf.where_3 (TFOpLambda)        (None, 64, 21, 2)    0           ['tf.math.equal_2[0][0]',         
                                                                  'tf.math.truediv_1[0][0]']       
                                                                                                   
 tf.where_4 (TFOpLambda)        (None, 64, 5, 2)     0           ['tf.math.equal_3[0][0]',         
                                                                  'tf.math.truediv_2[0][0]']       
                                                                                                   
 tf.math.reduce_sum (TFOpLambda  (None, 1, 1)        0           ['tf.where[0][0]']                
 )                                                                                                 
                                                                                                   
 tf.reshape (TFOpLambda)        (None, 64, 80)       0           ['tf.where_2[0][0]']              
                                                                                                   
 tf.reshape_1 (TFOpLambda)      (None, 64, 42)       0           ['tf.where_3[0][0]']              
                                                                                                   
 tf.reshape_2 (TFOpLambda)      (None, 64, 10)       0           ['tf.where_4[0][0]']              
                                                                                                   
 tf.math.equal (TFOpLambda)     (None, 1, 1)         0           ['tf.math.reduce_sum[0][0]']      
                                                                                                   
 embedding (Embedding)          (None, 64, 512)      986243      ['tf.reshape[0][0]',              
                                                                  'tf.reshape_1[0][0]',            
                                                                  'tf.reshape_2[0][0]',            
                                                                  'non_empty_frame_idxs[0][0]']    
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯| 
| embedding (Embedding)        multiple             33280       []                               | 
|                                                                                                | 
| lips_embedding (LandmarkEmbedd  multiple          178560      []                               | 
| ing)                                                                                           | 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|| lips_embedding_dense (Sequenti  (None, 64, 384)  178176     []                               || 
|| al)                                                                                          || 
|||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||| 
||| lips_embedding_dense_1 (Dense)  (None, 64, 384)  30720    []                               ||| 
|||                                                                                            ||| 
||| activation_1 (Activation)  (None, 64, 384)    0           []                               ||| 
|||                                                                                            ||| 
||| lips_embedding_dense_2 (Dense)  (None, 64, 384)  147456   []                               ||| 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯| 
| left_hand_embedding (LandmarkE  multiple          163968      []                               | 
| mbedding)                                                                                      | 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|| left_hand_embedding_dense (Seq  (None, 64, 384)  163584     []                               || 
|| uential)                                                                                     || 
|||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||| 
||| left_hand_embedding_dense_1 (D  (None, 64, 384)  16128    []                               ||| 
||| ense)                                                                                      ||| 
|||                                                                                            ||| 
||| activation_2 (Activation)  (None, 64, 384)    0           []                               ||| 
|||                                                                                            ||| 
||| left_hand_embedding_dense_2 (D  (None, 64, 384)  147456   []                               ||| 
||| ense)                                                                                      ||| 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯| 
| pose_embedding (LandmarkEmbedd  multiple          151680      []                               | 
| ing)                                                                                           | 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|| pose_embedding_dense (Sequenti  (None, 64, 384)  151296     []                               || 
|| al)                                                                                          || 
|||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯||| 
||| pose_embedding_dense_1 (Dense)  (None, 64, 384)  3840     []                               ||| 
|||                                                                                            ||| 
||| activation_3 (Activation)  (None, 64, 384)    0           []                               ||| 
|||                                                                                            ||| 
||| pose_embedding_dense_2 (Dense)  (None, 64, 384)  147456   []                               ||| 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯| 
| fc (Sequential)              (None, 64, 512)      458752      []                               | 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|| fully_connected_1 (Dense)  (None, 64, 512)      196608      []                               || 
||                                                                                              || 
|| activation (Activation)    (None, 64, 512)      0           []                               || 
||                                                                                              || 
|| fully_connected_2 (Dense)  (None, 64, 512)      262144      []                               || 
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯| 
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
 tf.where_1 (TFOpLambda)        (None, 64, 1)        0           ['tf.math.equal[0][0]',           
                                                                  'tf.expand_dims[0][0]',          
                                                                  'tf.where[0][0]']                
                                                                                                   
 transformer (Transformer)      (None, 64, 512)      4201472     ['embedding[0][0]',               
                                                                  'tf.where_1[0][0]']              
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯| 
| multi_head_attention (MultiHea  multiple          1050624     []                               | 
| dAttention)                                                                                    | 
|                                                                                                | 
| multi_head_attention_1 (MultiH  multiple          1050624     []                               | 
| eadAttention)                                                                                  | 
|                                                                                                | 
| sequential (Sequential)      (None, 64, 512)      1050112     []                               | 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|| dense_25 (Dense)           (None, 64, 1024)     525312      []                               || 
||                                                                                              || 
|| dropout (Dropout)          (None, 64, 1024)     0           []                               || 
||                                                                                              || 
|| dense_26 (Dense)           (None, 64, 512)      524800      []                               || 
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯| 
| sequential_1 (Sequential)    (None, 64, 512)      1050112     []                               | 
||¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯|| 
|| dense_52 (Dense)           (None, 64, 1024)     525312      []                               || 
||                                                                                              || 
|| dropout_1 (Dropout)        (None, 64, 1024)     0           []                               || 
||                                                                                              || 
|| dense_53 (Dense)           (None, 64, 512)      524800      []                               || 
|¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯| 
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
 tf.math.multiply (TFOpLambda)  (None, 64, 512)      0           ['transformer[0][0]',             
                                                                  'tf.where_1[0][0]']              
                                                                                                   
 tf.math.reduce_sum_1 (TFOpLamb  (None, 512)         0           ['tf.math.multiply[0][0]']        
 da)                                                                                               
                                                                                                   
 tf.math.reduce_sum_2 (TFOpLamb  (None, 1)           0           ['tf.where_1[0][0]']              
 da)                                                                                               
                                                                                                   
 tf.math.truediv_3 (TFOpLambda)  (None, 512)         0           ['tf.math.reduce_sum_1[0][0]',    
                                                                  'tf.math.reduce_sum_2[0][0]']    
                                                                                                   
 dropout (Dropout)              (None, 512)          0           ['tf.math.truediv_3[0][0]']       
                                                                                                   
 dense (Dense)                  (None, 250)          128250      ['dropout[0][0]']                 
                                                                                                   
================================================================================================== 
Total params: 5,315,965 
Trainable params: 5,315,965 
Non-trainable params: 0 
__________________________________________________________________________________________________

Using plot_model method, we can see a flow of the model.

tf.keras.utils.plot_model(model, show_shapes=True, show_dtype=True, show_layer_names=True, expand_nested=True, show_layer_activations=True)

No NaN Predictions

Before training, NaN (Not a Number) values are checked in the predictions made by the model.

if not cfg.PREPROCESS_DATA and cfg.TRAIN_MODEL: 
    y_pred = model.predict_on_batch(X_batch).flatten() 
    print(f'# NaN Values In Prediction: {np.isnan(y_pred).sum()}') 
 
 
''' 
# NaN Values In Prediction: 0 
'''

Weight Initialization

We will be plotting a histogram of the output probabilities of a softmax-initialized model, along with some additional information such as the mean and standard deviation of the probabilities and a red vertical line indicating the baseline probability of random guessing.

if not cfg.PREPROCESS_DATA and cfg.TRAIN_MODEL: 
    plt.figure(figsize=(12,5)) 
    plt.title(f'Softmax Output Initialized Model | µ={y_pred.mean():.3f}, σ={y_pred.std():.3f}', pad=25) 
    pd.Series(y_pred).plot(kind='hist', bins=128, label='Class Probability') 
    plt.xlim(0, max(y_pred) * 1.1) 
    plt.vlines([1 / cfg.NUM_CLASSES], 0, plt.ylim()[1], color='red', label=f'Random Guessing Baseline 1/NUM_CLASSES={1 / cfg.NUM_CLASSES:.3f}') 
    plt.grid() 
    plt.legend() 
    plt.show()

Learning Rate Scheduler

lrfn takes several arguments: “current_step” is the current step number during training, “num_warmup_steps” is the number of steps used for warmup (gradually increasing the learning rate), “lr_max” is the maximum learning rate, “num_cycles” is the number of cycles in the cosine annealing learning rate schedule (default is 0.50), and “num_training_steps” is the total number of training steps.

def lrfn(current_step, num_warmup_steps, lr_max, num_cycles=0.50, num_training_steps=cfg.N_EPOCHS): 
     
    if current_step < num_warmup_steps: 
        if cfg.WARMUP_METHOD == 'log': 
            return lr_max * 0.10 ** (num_warmup_steps - current_step) 
        else: 
            return lr_max * 2 ** -(num_warmup_steps - current_step) 
    else: 
        progress = float(current_step - num_warmup_steps) / float(max(1, num_training_steps - num_warmup_steps)) 
 
        return max(0.0, 0.5 * (1.0 + math.cos(math.pi * float(num_cycles) * 2.0 * progress))) * lr_max

The plot_lr_schedule function takes in a learning rate schedule and the total number of epochs and plots the learning rate for each epoch.

def plot_lr_schedule(lr_schedule, epochs): 
    fig = plt.figure(figsize=(20, 10)) 
    plt.plot([None] + lr_schedule + [None]) 
    # X Labels 
    x = np.arange(1, epochs + 1) 
    x_axis_labels = [i if epochs <= 40 or i % 5 == 0 or i == 1 else None for i in range(1, epochs + 1)] 
    plt.xlim([1, epochs]) 
    plt.xticks(x, x_axis_labels) # set tick step to 1 and let x axis start at 1 
     
    # Increase y-limit for better readability 
    plt.ylim([0, max(lr_schedule) * 1.1]) 
     
    # Title 
    schedule_info = f'start: {lr_schedule[0]:.1E}, max: {max(lr_schedule):.1E}, final: {lr_schedule[-1]:.1E}' 
    plt.title(f'Step Learning Rate Schedule, {schedule_info}', size=18, pad=12) 
     
    # Plot Learning Rates 
    for x, val in enumerate(lr_schedule): 
        if epochs <= 40 or x % 5 == 0 or x is epochs - 1: 
            if x < len(lr_schedule) - 1: 
                if lr_schedule[x - 1] < val: 
                    ha = 'right' 
                else: 
                    ha = 'left' 
            elif x == 0: 
                ha = 'right' 
            else: 
                ha = 'left' 
            plt.plot(x + 1, val, 'o', color='black'); 
            offset_y = (max(lr_schedule) - min(lr_schedule)) * 0.02 
            plt.annotate(f'{val:.1E}', xy=(x + 1, val + offset_y), size=12, ha=ha) 
     
    plt.xlabel('Epoch', size=16, labelpad=5) 
    plt.ylabel('Learning Rate', size=16, labelpad=5) 
    plt.grid() 
    plt.show() 
 
# Learning rate for encoder 
LR_SCHEDULE = [lrfn(step, num_warmup_steps=cfg.N_WARMUP_EPOCHS, lr_max=cfg.LR_MAX, num_cycles=0.50) for step in range(cfg.N_EPOCHS)] 
# Plot Learning Rate Schedule 
plot_lr_schedule(LR_SCHEDULE, epochs=cfg.N_EPOCHS) 
# Learning Rate Callback 
lr_callback = tf.keras.callbacks.LearningRateScheduler(lambda step: LR_SCHEDULE[step], verbose=1)

Weight Decay Callback

WeightDecayCallback is a custom callback class in TensorFlow Keras that updates the weight decay of the optimizer based on the learning rate. on_epoch_begin method is called at the beginning of each epoch and sets the weight decay of the optimizer to the product of the learning rate and the weight decay ratio. It also prints the learning rate and weight decay for monitoring.

# Custom callback to update weight decay with learning rate 
class WeightDecayCallback(tf.keras.callbacks.Callback): 
    def __init__(self, wd_ratio=cfg.WD_RATIO): 
        self.step_counter = 0 
        self.wd_ratio = wd_ratio 
     
    def on_epoch_begin(self, epoch, logs=None): 
        model.optimizer.weight_decay = model.optimizer.learning_rate * self.wd_ratio 
        print(f'learning rate: {model.optimizer.learning_rate.numpy():.2e}, weight decay: {model.optimizer.weight_decay.numpy():.2e}')

Performance Benchmark

Your model must also perform inference with less than 100 milliseconds of latency per video on average and use less than 40 MB of storage space. Expect to see approximately 40,000 videos in the test set. We allow an additional 10-minute buffer for loading the data and miscellaneous overhead.

%%timeit -n 100 
if cfg.TRAIN_MODEL: 
    # Verify model prediction is <<<100ms 
    model.predict_on_batch({ 'frames': X_train[:1], 'non_empty_frame_idxs': NON_EMPTY_FRAME_IDXS_TRAIN[:1] }) 
    pass 
 
''' 
16.8 ms ± 5.61 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) 
'''

Evaluate Initialized Model

We need to verify if the validation dataset covers all signs in the case of USE_VAL mode. The resulting DataFrame from pd.Series.value_counts().to_frame(‘Count’) has rows representing the counts of each unique value in the y_val array. The .iloc function is then used to select specific rows based on their position.

if cfg.USE_VAL: 
    # Verify Validation Dataset Covers All Signs 
    print(f'# Unique Signs in Validation Set: {pd.Series(y_val).nunique()}') 
    # Value Counts 
    display(pd.Series(y_val).value_counts().to_frame('Count').iloc[[1,2,3,-3,-2,-1]])

The model.evaluate() function returns the loss value and any specified metrics of the model. *validation_data passes a tuple containing the inputs and outputs of the validation dataset. * is used to unpack the elements of a tuple. The verbose parameter is set to 2, which means no progress updates will be printed during the evaluation.

# Sanity Check 
if cfg.TRAIN_MODEL and cfg.USE_VAL: 
    _ = model.evaluate(*validation_data, verbose=2)

Train

It creates a fresh model using get_model(), prints a summary of the model’s architecture, and trains the model using model.fit(), which takes various arguments and Keras callbacks for training. The training history is stored in history.

if cfg.TRAIN_MODEL: 
    # Clear all models in GPU 
    tf.keras.backend.clear_session() 
 
 
    # Get new fresh model 
    model = get_model() 
     
    # Sanity Check 
    model.summary() 
 
    # Actual Training 
    history = model.fit( 
            x=get_train_batch_all_signs(X_train, y_train, NON_EMPTY_FRAME_IDXS_TRAIN), 
            steps_per_epoch=len(X_train) // (cfg.NUM_CLASSES * cfg.BATCH_ALL_SIGNS_N), 
            epochs=cfg.N_EPOCHS, 
            # Only used for validation data since training data is a generator 
            batch_size=cfg.BATCH_SIZE, 
            validation_data=validation_data, 
            callbacks=[ 
                lr_callback, 
                WeightDecayCallback(), 
                wandb.keras.WandbCallback() 
            ], 
            verbose = cfg.VERBOSE, 
        )
# Save Model Weights 
model.save_weights('model.h5')

Validation predictions are performed in the case of USE_VAL mode.

if cfg.USE_VAL: 
    # Validation Predictions 
    y_val_pred = model.predict({ 'frames': X_val, 'non_empty_frame_idxs': NON_EMPTY_FRAME_IDXS_VAL }, verbose=2).argmax(axis=1) 
    # Label 
    labels = [ORD2SIGN.get(i).replace(' ', '_') for i in range(cfg.NUM_CLASSES)]

Landmark Attention Weights

We need to retrieve the landmark_weights from the weights of the embedding layer of the model so that we can use the softmax function from the scipy.special module to normalize them. The weights of each of the landmarks (lips, left hand, and pose) indicate how much contribute to the overall recognition model’s decision-making process.

# Landmark Weights 
for w in model.get_layer('embedding').weights: 
    if 'landmark_weights' in w.name: 
        weights = scipy.special.softmax(w) 
 
landmarks = ['lips_embedding', 'left_hand_embedding', 'pose_embedding'] 
 
for w, lm in zip(weights, landmarks): 
    print(f'{lm} weight: {(w*100):.1f}%') 
 
 
''' 
lips_embedding weight: 28.5% 
left_hand_embedding weight: 46.8% 
pose_embedding weight: 24.7% 
'''

Classification Report

The classification report uses sklearn.metrics.classification_report function, which takes in the true labels y_val, predicted labels y_val_pred, target names (labels) of the classes, and an argument output_dict=True to return the report as a dictionary.

def print_classification_report(): 
    # Classification report for all signs 
    classification_report = sklearn.metrics.classification_report( 
            y_val, 
            y_val_pred, 
            target_names=labels, 
            output_dict=True, 
        ) 
    # Round Data for better readability 
    classification_report = pd.DataFrame(classification_report).T 
    classification_report = classification_report.round(2) 
    classification_report = classification_report.astype({ 
            'support': np.uint16, 
        }) 
    # Add signs 
    classification_report['sign'] = [e if e in SIGN2ORD else -1 for e in classification_report.index] 
    classification_report['sign_ord'] = classification_report['sign'].apply(SIGN2ORD.get).fillna(-1).astype(np.int16) 
    # Sort on F1-score 
    classification_report = pd.concat(( 
        classification_report.head(cfg.NUM_CLASSES).sort_values('f1-score', ascending=False), 
        classification_report.tail(3), 
    )) 
 
    pd.options.display.max_rows = 999 
    display(classification_report) 
 
if cfg.USE_VAL: 
    print_classification_report()

This code resulted in a score of 0.72 on the leaderboard.

I hope this article is helpful for you. Thank you for reading :D