The Transformer

... and some popular applications

Recurrent Neural Networks

A class of neural networks which allow previous outputs to be used as inputs while having hidden states

rnn
Figure 1. Diagram of an RNN. Image source: Stanford CS230.

Challenges

Slow to train

  • BPTT is used to compute gradients; involves unfolding the RNN over time and backpropagating errors through the entire sequence

unfold through time
Figure 2. Unfolding of RNN Through Time. Image source: Wikipedia.

Long term dependency problem

  • As the distance between the current input and the relevant past information increases, gradients tend to diminish exponentially

Static
Figure 3. Vanishing and Exploding Gradients. Image source: SuperAnnotate.

Inability to Consider Future Input

  • RNNs are designed to model sequential data

  • They process data sequentially, one time step at a time

  • At each step, the network updates its hidden state based on the current input and the previous hidden state

  • For many tasks, future context is crucial; in NLP, meaning of a word may change depending on words that follow it

Introducing Transformer

"Attention Is All You Need" [Vaswani et. al.] introduced the Transformer architecture

Developed at Google in 2017

Novelty: Eschews recurrence, built on only attention mechanisms

Originally designed for machine translation tasks; excellent performance on English \(\rightarrow\) German and English \(\rightarrow\) French [WMT 2014]

Generalizes well to several other tasks

Architecture Overview

Multi-Head attention is the heart of this model

Input: \( (\text{Sequence}, d_\text{model}) \)

Original architecture comprises encoder and decoder

The decoder block introduces masked multi-head attention

Notably, Transformers are highly parallelizable

Static
Figure 4. The full transfomer model. Image source: "Attention Is All You Need" [Vaswani et. al.]

Task: Text Generation

We demonstrate the working of a Transformer, taking the example of a text generation or next word prediction task. Important terms:

  1. Vocabulary: the set of all possible tokens recognized by the model

  2. Embeddings: the words, represented as vectors

  3. Self-Attention: captures relationships between words in the sentence

  4. Multi-Head Attention: running self-attention several times in parallel

Embeddings

Words are commonly represented as vectors in a high dimensional space to capture their relationships and semantics

word embedding
Figure 5. Example of word embeddings in a 3-D space. Image source: Baeldung.

Input Embeddings

The sentence is first tokenized. Note: Tokens may not be full words!

Static
Figure 6. Representation of Embeddings. Image source: own work.

Positional Embeddings

For each word, information about its position in the sentence is encoded in the embeddings

This is to ensure the model treats words that are:

  • close to each other as "close"

  • distant from each other as "distant"

Positional encoding is done with the goal of representing a pattern that can be learned by the model

Static
Figure 7. Representation of Positional Embeddings. Image source: own work.
Static
Figure 8. Calculation of positional encoding. Image source: own work.

Self-Attention

Self-attention allows the model to relate words to each other

\[\text{Attention}(Q, K, V) = \text{softmax} \bigg( \frac{QK^T}{\sqrt{d_k}} \bigg) V\]
  • \(Q \rightarrow\) queries

  • \(K \rightarrow\) keys

  • \(d_k \rightarrow\) dimensions of \(K\) \(=\) dimensions of \(Q\)

  • \(V \rightarrow\) values

Queries? Keys?!

Intuitively, an attention mechanism computes the relative importance of the inputs in the sequence (called keys) for a particular output (called query)

Queries and keys are simply vectors computed from the input and output sequence respectively

For text generation task, input and output are the same sequence!

Queries and keys are calculated by multiplying weight matrices \(W^Q\) and \(W^K\) with each embedding \(\vec{E_i} \; (i = 1..\text{seq})\).

Practically, multiple query vectors are combined into a single matrix, \(Q\). Similarly, multiple key vectors \(\rightarrow\) K.

Computing Self-Attention

Prerequisite: matrices \(Q\), \(K\) and \(V\) must be computed

As previously mentioned, these matrices are simply a convenient way to group together multiple vectors

We start by computing \(Q K^T\)

Static
Figure 9. Representation of \(Q K^T\) multiplication. Image source: own work.
Static
Figure 10. The \(QK^T\) matrix. Image source: own work.
Static
Figure 11. The \(\text{softmax}\Big( \frac{QK^T}{\sqrt d_k} \Big)\) matrix. Image source: own work.

Now we obtain \(V\) in a similar fashion by multiplying a matrix \(W^V\) with each of the embedding vectors

Finally, we multiply \(\text{softmax}\Big( \frac{QK^T}{\sqrt d_k} \Big)\) with \(V\)

Static
Figure 12. Representation of \(\text{softmax}\Big( \frac{QK^T}{\sqrt d_k} \Big)\) V. Image source: own work.

Properties of Self-Attention

  • Permutation invariant: Does not change if the order of its inputs changes; however, position information is still preserved!

  • Values along the diagonal are expected to be highest

Static
Figure 13. The \(\text{softmax}\Big( \frac{QK^T}{\sqrt d_k} \Big)\) matrix with diagonals marked. Image source: own work.
  • To inhibit interactions at some positions, their values can be set to \(-\infty\) before applying softmax.

Static
Figure 14. Before applying softmax
Static
Figure 15. After applying softmax

Layer Normalization

Static
Figure 16. Common Normalization Techniques. Image source: "Layer Normalization" [Lei Ba et. al.]

Multi-Head Attention

So far, only a single head of attention has been discussed.

Multi-head attention comprises \(h\) attention heads stacked in parallel.

\[\text{MultiHead}(Q, K, V) = Concat(\text{head}_1, \dots, \text{head}_h) W^0\]
\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Training

Training occurs entirely in a single time step!

Static

Inference

Time step 1:

Static

Time step 2:

Static

Time step 3:

Static

Time step 4:

Static

Final Thoughts

Static
Figure 17. The full transfomer model. Image source: "Attention Is All You Need" [Vaswani et. al.]

Whisper

Whisper is an automatic speech recognition (ASR) system developed by OpenAI

Trained on 680,000 hours of multilingual and multitask supervised data collected from the web

Aims to provide "human level robustness and accuracy" on English speech recognition

Motivation

Prior speech recognition models required extensive dataset-specific fine-tuning

Resulting lack of generalizability

The aim was to design a model which focuses on large-scale supervised pre-training on diverse datasets

Architecture

Static
Figure 18. Architecture of Whisper. Image source: "Robust Speech Recognition via Large-Scale Weak Supervision"

Some key points:

  • Encoder-decoder architecture

  • Audio broken into 30 second chunks

  • Resampled to 16,000 Hz

  • 80-channel log-magnitude Mel spectrogram computed on 25-millisecond windows with a stride of 10 milliseconds

Mel Spectrogram

Static
Figure 19. A Mel Spectrogram. Image source: "Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset" [de Benito et. al.]

Multitask Format

  • Transcription is just one part of the overall speech recognition problem

  • Voice activity detection, speaker diarization, and inverse text normalization etc.

  • Handling these separately makes the system complex

  • Whisper attempts to perform the entire speech recognition pipeline using a single model

Sequence to Sequence Learning

Static
Figure 20. Sequence to Sequence Learning. Image source: "Robust Speech Recognition via Large-Scale Weak Supervision"

Multitask Training Format

Static
Figure 21. Multitask Training Format. Image source: "Robust Speech Recognition via Large-Scale Weak Supervision"

Evaluation Metrics

Static
Figure 22. Correlation of pre-training supervision amount with model performance. Image source: "Robust Speech Recognition via Large-Scale Weak Supervision"

References

[1]
A. Vaswani et al., ‘Attention is all you need’, in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California, USA, 2017, pp. 6000–6010.
[2]
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, ‘Robust Speech Recognition via Large-Scale Weak Supervision’, in International Conference on Machine Learning, 2022.

Thank You!