Intuitively, an attention mechanism computes the relative importance of the inputs in the sequence (called keys) for a particular output (called query)
Queries and keys are simply vectors computed from the input and output sequence respectively
For text generation task, input and output are the same sequence!
Queries and keys are calculated by multiplying weight matrices \(W^Q\) and \(W^K\) with each embedding \(\vec{E_i} \; (i = 1..\text{seq})\).
Practically, multiple query vectors are combined into a single matrix, \(Q\). Similarly, multiple key vectors \(\rightarrow\) K.