The Transformer

... and some popular applications

Recurrent Neural Networks

A class of neural networks which allow previous outputs to be used as inputs while having hidden states

Figure 1. Diagram of an RNN. Image source: Stanford CS230.

Challenges

Slow to train

BPTT is used to compute gradients; involves unfolding the RNN over time and backpropagating errors through the entire sequence

Figure 2. Unfolding of RNN Through Time. Image source: Wikipedia.

Long term dependency problem

As the distance between the current input and the relevant past information increases, gradients tend to diminish exponentially

Figure 3. Vanishing and Exploding Gradients. Image source: SuperAnnotate.

Inability to Consider Future Input

RNNs are designed to model sequential data
They process data sequentially, one time step at a time
At each step, the network updates its hidden state based on the current input and the previous hidden state
For many tasks, future context is crucial; in NLP, meaning of a word may change depending on words that follow it

Introducing Transformer

"Attention Is All You Need" [Vaswani et. al.] introduced the Transformer architecture

Developed at Google in 2017

Novelty: Eschews recurrence, built on only attention mechanisms

Originally designed for machine translation tasks; excellent performance on English \(\rightarrow\) German and English \(\rightarrow\) French [WMT 2014]

Generalizes well to several other tasks

Architecture Overview

Multi-Head attention is the heart of this model

Input: \( (\text{Sequence}, d_\text{model}) \)

Original architecture comprises encoder and decoder

The decoder block introduces masked multi-head attention

Notably, Transformers are highly parallelizable

Figure 4. The full transfomer model. Image source: "Attention Is All You Need" [Vaswani et. al.]

Task: Text Generation

We demonstrate the working of a Transformer, taking the example of a text generation or next word prediction task. Important terms:

Vocabulary: the set of all possible tokens recognized by the model
Embeddings: the words, represented as vectors
Self-Attention: captures relationships between words in the sentence
Multi-Head Attention: running self-attention several times in parallel

Embeddings

Words are commonly represented as vectors in a high dimensional space to capture their relationships and semantics

Figure 5. Example of word embeddings in a 3-D space. Image source: Baeldung.

Input Embeddings

The sentence is first tokenized. Note: Tokens may not be full words!

Figure 6. Representation of Embeddings. Image source: own work.

Positional Embeddings

For each word, information about its position in the sentence is encoded in the embeddings

This is to ensure the model treats words that are:

close to each other as "close"
distant from each other as "distant"

Positional encoding is done with the goal of representing a pattern that can be learned by the model

Figure 7. Representation of Positional Embeddings. Image source: own work.

Figure 8. Calculation of positional encoding. Image source: own work.

Self-Attention

Self-attention allows the model to relate words to each other

\[\text{Attention}(Q, K, V) = \text{softmax} \bigg( \frac{QK^T}{\sqrt{d_k}} \bigg) V\]

\(Q \rightarrow\) queries
\(K \rightarrow\) keys
\(d_k \rightarrow\) dimensions of \(K\) \(=\) dimensions of \(Q\)
\(V \rightarrow\) values

Queries? Keys?!

Intuitively, an attention mechanism computes the relative importance of the inputs in the sequence (called keys) for a particular output (called query)

Queries and keys are simply vectors computed from the input and output sequence respectively

For text generation task, input and output are the same sequence!

Queries and keys are calculated by multiplying weight matrices \(W^Q\) and \(W^K\) with each embedding \(\vec{E_i} \; (i = 1..\text{seq})\).

Practically, multiple query vectors are combined into a single matrix, \(Q\). Similarly, multiple key vectors \(\rightarrow\) K.

Computing Self-Attention

Prerequisite: matrices \(Q\), \(K\) and \(V\) must be computed

As previously mentioned, these matrices are simply a convenient way to group together multiple vectors

We start by computing \(Q K^T\)

Figure 9. Representation of \(Q K^T\) multiplication. Image source: own work.

Figure 10. The \(QK^T\) matrix. Image source: own work.

Figure 11. The \(\text{softmax}\Big( \frac{QK^T}{\sqrt d_k} \Big)\) matrix. Image source: own work.

Now we obtain \(V\) in a similar fashion by multiplying a matrix \(W^V\) with each of the embedding vectors

Finally, we multiply \(\text{softmax}\Big( \frac{QK^T}{\sqrt d_k} \Big)\) with \(V\)

Figure 12. Representation of \(\text{softmax}\Big( \frac{QK^T}{\sqrt d_k} \Big)\) V. Image source: own work.

Properties of Self-Attention

Permutation invariant: Does not change if the order of its inputs changes; however, position information is still preserved!

Values along the diagonal are expected to be highest

Figure 13. The \(\text{softmax}\Big( \frac{QK^T}{\sqrt d_k} \Big)\) matrix with diagonals marked. Image source: own work.

To inhibit interactions at some positions, their values can be set to \(-\infty\) before applying softmax.

Figure 14. Before applying softmax

Figure 15. After applying softmax

Layer Normalization

Figure 16. Common Normalization Techniques. Image source: "Layer Normalization" [Lei Ba et. al.]

Multi-Head Attention

So far, only a single head of attention has been discussed.

Multi-head attention comprises \(h\) attention heads stacked in parallel.

\[\text{MultiHead}(Q, K, V) = Concat(\text{head}_1, \dots, \text{head}_h) W^0\]

\[\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\]

Training

Training occurs entirely in a single time step!

Inference

Time step 1:

Time step 2:

Time step 3:

Time step 4:

Final Thoughts

Figure 17. The full transfomer model. Image source: "Attention Is All You Need" [Vaswani et. al.]

Whisper

Whisper is an automatic speech recognition (ASR) system developed by OpenAI

Trained on 680,000 hours of multilingual and multitask supervised data collected from the web

Aims to provide "human level robustness and accuracy" on English speech recognition

Motivation

Prior speech recognition models required extensive dataset-specific fine-tuning

Resulting lack of generalizability

The aim was to design a model which focuses on large-scale supervised pre-training on diverse datasets

Architecture

Figure 18. Architecture of Whisper. Image source: "Robust Speech Recognition via Large-Scale Weak Supervision"

Some key points:

Encoder-decoder architecture
Audio broken into 30 second chunks
Resampled to 16,000 Hz
80-channel log-magnitude Mel spectrogram computed on 25-millisecond windows with a stride of 10 milliseconds

Mel Spectrogram

Figure 19. A Mel Spectrogram. Image source: "Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset" [de Benito et. al.]

Multitask Format

Transcription is just one part of the overall speech recognition problem
Voice activity detection, speaker diarization, and inverse text normalization etc.
Handling these separately makes the system complex
Whisper attempts to perform the entire speech recognition pipeline using a single model

Sequence to Sequence Learning

Figure 20. Sequence to Sequence Learning. Image source: "Robust Speech Recognition via Large-Scale Weak Supervision"

Multitask Training Format

Figure 21. Multitask Training Format. Image source: "Robust Speech Recognition via Large-Scale Weak Supervision"

Evaluation Metrics

Figure 22. Correlation of pre-training supervision amount with model performance. Image source: "Robust Speech Recognition via Large-Scale Weak Supervision"

References

[1]

A. Vaswani et al., ‘Attention is all you need’, in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California, USA, 2017, pp. 6000–6010.

[2]

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, ‘Robust Speech Recognition via Large-Scale Weak Supervision’, in International Conference on Machine Learning, 2022.