Background
Existing RNNs could handle sequence data, but dealing with sequence data that has missing elements (as shown above) was very difficult.
The Transformer was introduced to address this.
Transformer
No recurrent structure like RNNs.
Transformer is the first sequence transduction model based entirely on attention.

It was originally a model for machine translation. But since the Transformer is a methodology for processing sequential data and encoding it, it can be used beyond translation.
Recently, Transformer and self-attention are used in virtually every field.

The Transformer is a sequence-to-sequence model as shown above. Let’s look more closely.

Unlike RNNs, there’s no recurrence. If 3 words were input to an RNN, it would recur 3 times to produce output.
But the Transformer produces encoding vectors all at once in a single encoding pass, regardless of whether there are 3 or 100 words. The output side does use something autoregressive, though.
Key concept of transformer
- How are n words processed at once during encoding?
- What information flows between encoder and decoder?
- How does the decoder generate output?
Encoder
Takes all vectors as input. Self-attention plays a key role in both encoder and decoder. The feed-forward NN that follows is the familiar MLP.

- Self-attention takes n vectors.
- To transform input vector into , all vectors are used.
- All paths for creating vectors are interdependent.
- When passes through the feed-forward NN, it’s processed in parallel, independently.
Self-attention
To analyze the sentence below, a dependent network like this is constructed:
The animal didn’t cross the street because it was too tired.

Humans naturally understand “it” refers to “animal.” When learned through self-attention, it shows strong dependency near “animal” as shown.

Query, Key, Value vectors are computed per word (= embedding). One embedding produces one query, one key, and one value.
Encoder computation

From the lecture, explaining this in words is really hard, but the math is simple.
In words:
- score = inner product of query and key
- = key vector dimension
- softmax result = softmax applied to score divided by
- sum = softmax result x value
In matrices and formulas:

Input X is represented as a matrix.
- row = number of words
- column = embedding dimension
Multiplying X by separate weight matrices for query, key, and value gives Q, K, V.
- attention dimension = key vector dimension

The rest follows the verbal explanation directly in formula form.
- softmax = row-wise softmax
- dim(V) can differ from dim(Q) and dim(K).
- In practice, they’re usually all the same for convenience.
Transformer characteristics
MLPs and CNNs produce fixed outputs for fixed inputs.
But in a Transformer, even if one input is fixed, different surrounding inputs can change the output. This means it can represent far more things, but also requires more computation, so input length can’t grow without bound.
MHA (Multi-Head Attention)
Instead of a single set, multiple query, key, and value sets are created for each input.

Applying n attention heads to one input yields n outputs.
The key challenge is matching input and output dimensions. This is solved by concatenating results and multiplying by a matrix that projects back to the input dimension.

Summarizing this entire process:
ref: https://jalammar.github.io/illustrated-transformer/
In theory, the diagram above is all you need, but actual implementation differs. For instance, if input X is 100-dimensional, it might be split into 10 parts. I’ll explain in the practical post.
Positional encoding
A value is added to the input, similar to a bias. This is needed because position-dependent variation is important. Without positional encoding, reordering a sentence would be undetectable. So positional encoding captures ordering information.
Encoder Overview

Information flow between Encoder and Decoder

undefined
GIF showing encoder information moving to the decoder.
- The encoder sends key and value to the decoder.
- The encoder inner-products query with keys of other words to create attention, then multiplies by value. To get the attention map, you need key and value.
- Because the decoder creates its own query from its own input.
- Since encoders are stacked, upper layers produce words. (?)
- Output sentence is generated autoregressively.
Decoder
Self-attention
Before the softmax step, a mask is applied to future information. Learning through the decoder while knowing the future is meaningless, so only preceding information is accessible.
Encoder-Decoder attention
As mentioned above. The “Encoder-Decoder Attention” layer works like MHA, except: the query comes from the previous layer’s output matrix, and key and value come from the encoder stack.
Final layer

Calling it “final layer” for convenience. The decoder stack’s output is converted into a word distribution.
Vision Transformer
The original Transformer paper was for machine translation, but it’s been adopted in CV too.
Images are split into patches, go through word-like embedding, and then through the Transformer.
DALL-E

A paper that generates images from text. It uses GPT-3.