Note: The purpose of this post is as a personal reflection and not as a tutorial.

Relevant Links

Wav2Letter on Facebook Research

Paper: Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Paper: Letter-Based Speech Recognition with Gated ConvNets

Paper: Language Modeling with Gated Convolutional Networks

GitHub PyTorch Implementation

Gated Convolution

This paper: Language Modeling with Gated Convolutional Networks

has information on implementing gated convolutions used gated linear units. They present this equation: \(hl(X) = (X*W + b) \otimes \sigma(X*V + c)\)

I believe that the desired results for a gated convolutional layer can be achieved in TensorFlow with this code. I also added batch normalisation because in my experiments the outputs could sometimes explode to high values otherwise.

class GatedConvolution(tf.keras.layers.Layer):
  def __init__(self, filters, kernel_size, dropout_rate, padding='causal'):
    super(GatedConvolution, self).__init__()

    self.convolution = tf.keras.layers.Conv1D(
    filters=filters, kernel_size=kernel_size, padding=padding,
    )

    self.gated = tf.keras.layers.Conv1D(
    filters=filters, kernel_size=kernel_size, padding=padding,
    )

    self.multiply = tf.keras.layers.Multiply()
    
    self.norm = tf.keras.layers.BatchNormalization()
    
    self.dropout = tf.keras.layers.Dropout(dropout_rate)
    
  def call(self, x, training):

    convolution_output = self.convolution(x)  # (batch_size, input_seq_len, d_model)

    gate_output = tf.keras.activations.sigmoid(self.gated(x))
    
    output = self.multiply([convolution_output, gate_output])
    
    output = self.norm(output, training=training)

    output = self.dropout(output, training=training)
    
    return output

Network Structure

The facebook research paper presents 3 different network architectures. I’ll start by implementing the one that was designed for the WSJ dataset. Note that the PyTorch implementations are available on their GitHub and it was there I found this architecture file:

WN 3 C NFEAT 200 13 1 -1
GLU 2
DO 0.25

WN 3 C 100 200 3 1 -1
GLU 2
DO 0.25

WN 3 C 100 200 4 1 -1
GLU 2
DO 0.25

WN 3 C 100 250 5 1 -1
GLU 2
DO 0.25

WN 3 C 125 250 6 1 -1
GLU 2
DO 0.25

WN 3 C 125 300 7 1 -1
GLU 2
DO 0.25

WN 3 C 150 350 8 1 -1
GLU 2
DO 0.25

WN 3 C 175 400 9 1 -1
GLU 2
DO 0.25

WN 3 C 200 450 10 1 -1
GLU 2
DO 0.25

WN 3 C 225 500 11 1 -1
GLU 2
DO 0.25

WN 3 C 250 500 12 1 -1
GLU 2
DO 0.25

WN 3 C 250 500 13 1 -1
GLU 2
DO 0.25

WN 3 C 250 600 14 1 -1
GLU 2
DO 0.25

WN 3 C 300 600 15 1 -1
GLU 2
DO 0.25

WN 3 C 300 750 21 1 -1
GLU 2
DO 0.25

RO 2 0 3 1
WN 0 L 375 1000
GLU 0
DO 0.25

WN 0 L 500 NLABEL

It is difficult to work out exactly what this is communicating but I believe that (WN 3 C) represent convolutions, (WN 0 L) represents linear layers, (GLU 2) represents gated linear units and (DO) represents dropout.

The corresponding paper: Letter-Based Speech Recognition with Gated ConvNets

also gives hints as to what the parameters are corresponding to.

From this information I designed this TensorFlow model:

class GatedConvolutionalEncoder(tf.keras.Model):
  def __init__(self):
    super(GatedConvolutionalEncoder, self).__init__()

    self.num_convolutions = 14
  
    self.convolutions = [GatedConvolution(filters=100, kernel_size=3, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=100, kernel_size=4, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=100, kernel_size=5, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=125, kernel_size=6, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=125, kernel_size=7, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=150, kernel_size=8, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=175, kernel_size=9, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=200, kernel_size=10, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=225, kernel_size=11, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=250, kernel_size=12, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=250, kernel_size=13, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=250, kernel_size=14, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=300, kernel_size=15, padding='causal', dropout_rate=0.25),
                        GatedConvolution(filters=300, kernel_size=21, padding='causal', dropout_rate=0.25)]

    self.fc1 = tf.keras.layers.Dense(1000)
    self.final_layer = tf.keras.layers.Dense(input_vocab_size)

  def call(self, inp, training=True):
    output = inp
    for i in range(self.num_convolutions):
      output = self.convolutions[i](output, training=training)
    #output = self.convolutions[0]
    #output = self.convolutions[1]

    output = self.fc1(output)
    output = self.final_layer(output)
    return output

Training

I trained on the JVS Japanese dataset. I identified some very strange and unstable training curves.

image1

I thought that perhaps I had implemented the gating incorrectly so I experimented with removing this and just having convolutions but these large value explosions still occured.

I then tried changing datasets to the JSUT which is shorter and has a single speaker.

image2

The training curve still exhibited strange behaviour where it would decrease and then increase. This is a normal pattern to see on a validation curve when there is overfitting but it seemed very strange that this would happen on a training curve.

I searched what might cause this and I found that people had similar strange increases in the training curve when they used Adam optimiser with a learning rate that was too high. I changed the learning rate from 0.001 the default to 0.00001. I then trained for 5 epochs on JSUT.

image3

I now have a reasonable training curve shape, however I feel the loss is still too high and the model is failing at its task of speech recognition. I then try with the larger JVS dataset for 5 epochs.

image4

The training and validation loss still seems to be decreasing, so I will try training for a bit longer.