Introduction to the Transformer

Before the ‘Attention is all you need’ (2017) paper, attention was just an add-on to LSTMs and GRUs. That paper removed the existing RNN model entirely and proposed a new Seq2Seq model based on attention alone.

Existing RNNs

![](/assets/images/Transformer 도입/6db7e06a-32fe-4da2-9829-fe67bf7b4796-image.png)

The sequence is received token by token at each step. At each step, the hidden state is encoded and output. Attention was also used to weight information from previous steps, determining how much prior information to use at the current step.

But due to the RNN’s inherent property of processing time proportional to sequence length, fundamental problems like long-term dependency and gradient vanishing/exploding couldn’t be fully solved.

Bi-Directional RNNs

![](/assets/images/Transformer 도입/c77a3d16-d4ae-4354-8c2e-d1813138d51b-image.png) Existing RNNs can’t use information about future steps at earlier steps. In the Forward RNN above, ‘I’ can’t access information about ‘go’ or ‘home’. But contexts that require later-occurring information clearly exist, so Backward RNNs were considered.

The idea is to learn in parallel with the reversed sequence. In the example above, for ‘go’ you get two hidden states: $h_2^f$ and $h_2^b$ . Simply concatenating them gives a new hidden state twice the original size.

Transformer

Self-attention

![](/assets/images/Transformer 도입/d2a126d4-f92e-494d-9566-8ed18f4eaad6-image.png) (Figure 1)

Suppose input vectors exist as shown above. The Transformer creates encoding vectors that reflect the sequence contents well, in the same format as the input vectors.

If we view this figure as a decoder, then I, go, home are inputs occurring as hidden states at specific time steps. Simultaneously, I, go, home can be viewed as the set of encoder hidden state vectors.

![](/assets/images/Transformer 도입/1c87ed9e-bc78-4f1c-a9ca-87c34dd1379f-image.png) In the previous attention post, similarity was computed via inner product between encoder hidden states and decoder hidden state vectors.

Performing this process within a single vector set is why (Figure 1) is called a self-attention module.

Problem

Since similarity is measured within a single vector set, the similarity of a vector with itself will be much higher than with others. The output vector will just contain self-information.

What we want is meaningful sequence interpretation (like in Seq2Seq), not self-replication.

Solution

To address this, three vectors are devised:

Query: determines which vectors from the set to selectively use.
Key: the material vector that gets inner-producted with Query for similarity.
Values: the vector to which the query-key similarity weights are applied.

Self-attention structure The input vector set is transformed into query, key, and value vectors through separate linear transform matrices $W^Q$ , $W^K$ , $W^V$ .

Encoding process Suppose we encode ‘I’.

Q, K, V computation Create Query vector via $W^Q$ , Key vector via $W^K$ , Values vector via $W^V$ . The similarity generated by $k_n$ corresponds 1-to-1 with $v_n$ .
Compute attention weights Compute similarity between Q and K and apply softmax. As in (Figure 1), we can see that self-similarity is lower than what you’d naively expect. We get objective similarity that isn’t dominated by self-similarity.
Apply attention weights to value vectors We obtain the desired encoding vector for ‘I’.
Repeat for the next sequence element Only the Query is newly computed via $W^Q$ ; Key and Value reuse previously computed results.

Parallelization with matrices

![](/assets/images/Transformer 도입/4c2a7006-2f55-4aba-bfa9-bd71c4693eea-image.png)

Consider the sentence ‘Thinking Machines’. To perform self-attention, we need Q, K, V. We could compute Q, K, V for ‘Thinking’ ( $X_1$ ) first, compute the output vector, then do the same for $X_2$ .

But let’s parallelize with matrices.

Let matrix $X$ be the concatenation of $X_1$ and $X_2$ . The dot product of $X$ with $W^Q$ gives $Q$ , which equals the concatenation of $Q_1$ and $Q_2$ . The same applies for $K$ and $V$ .

Improving on RNN limitations

The Transformer produces encoding vectors by converting all sequence contents into Q, K, V, regardless of sequence length or order. It has broken free from the fundamental RNN structure where the hidden state struggles to carry earlier information as sequences grow longer, substantially improving the long-term dependency problem.

Scaled Dot-Product Attention

Input: query $q$ , a set of key-value $(k,v)$
Q, K, V, Output is all vector!
Output: weighted sum of values
Weighted vector: inner product of query and corresponding key
Queries and keys have same dimensionality $d_k$ .
Dimensionality of value is $d_v$ .
$d_k$ $d_{k}$ and $d_v$ $d_{v}$ don’t have to be equal!
- Weight application to values is a scalar multiplication independent of dimension count.

Generalization

![](/assets/images/Transformer 도입/d2839f00-d508-44d2-b13d-a0c73b9f3e4c-image.png)

i: which input vector
j: index for computing inner products of q with all k’s

![](/assets/images/Transformer 도입/eadc7d76-82cd-4c03-9371-6fbae7532ad8-image.png)

The first formula shows the self-attention result for a single query. Let’s extend this to a Query matrix, where each row is a query.

![](/assets/images/Transformer 도입/a096577f-69ad-431b-bad1-37c4346ebad8-image.png)

The product of Q and K matrices creates a matrix whose rows are pre-softmax vectors for all query-key inner products.
Apply softmax to the Q-K product result. When softmax is applied to a matrix, it’s row-wise softmax by default. Each row becomes a weight vector summing to 1.
Multiply the result of step 2 by the values matrix. This performs the intended element-wise multiplication of weight vectors with value vectors. Since it’s an inner product, weights are multiplied along columns and summed.

This was confusing because the computation order differs slightly from the earlier conceptual diagrams.

![](/assets/images/Transformer 도입/b67e0323-8b47-4c86-bc61-ffc314f4a5ac-image.png)

Here, weight 0.2 is scalar-multiplied to $v_1$ , 0.1 to $v_2$ , 0.7 to $v_3$ , producing modified $v_1, v_2, v_3$ . These 3 vectors are then simply added to form a 3-dimensional vector.

![](/assets/images/Transformer 도입/37b48225-c5a8-4330-b896-6d7e9badb784-image.png)

In the actual formula, weights for all queries are already computed and stored row by row. Values are also stored row by row. So multiplying the two matrices applies column-wise weight multiplication and summation per row.

The conceptual explanations are being computed in parallel via matrix operations.

Transformer summary

![](/assets/images/Transformer 도입/3d4dbbf7-ce2a-443f-a640-78a9db4d9d03-image.png) Only the scaling part is different; this summarizes everything explained above.

Simplified computation order:

Q, K, V computation
Perform attention computation per the formula above
Obtain encoding vector!

As a diagram: ![](/assets/images/Transformer 도입/abe2b34e-7c3a-41d0-84a1-bfcec1de9fd3-image.png)

Why scale by $\sqrt{d_k}$

Among peers, we just understood it as preventing gradient exploding. That’s somewhat valid, but Professor Jaegeol Ju’s lecture gave the rigorous answer.

Variance Assume two vectors with mean 0 and variance 1:

(a, b), (x, y)

Their product can be computed as $ax+by$ by transposing (x,y).

Mathematical facts

Multiplying random variables with the same mean preserves mean and variance.
Adding two random variables adds their variances.
Dividing a random variable by $\sqrt{n}$ divides its variance by $n$ .

So the variance of $ax+by$ is 2. Here we used 2D vectors so the difference is small, but with 100 dimensions, the variance would be 100.

If variance is 2, the standard deviation is $\sqrt{2}$ , and say the product result is $[1.1, -0.8, -1.7]$ . If variance jumped to 100, these might become $[8, -11, 7]$ . (Not exact values.)

In softmax, larger standard deviation means the probability distribution concentrates on the largest values. Smaller standard deviation yields a more uniform distribution.

So as dimension size grows, the probability distribution may deviate significantly from what was intended. If certain softmax probabilities become too large or small, gradient vanishing becomes a risk.

Therefore, using the mathematical facts above, we restore the variance. Dividing $ax+by$ by $\sqrt{2}$ divides the variance by 2, bringing it back to 1.

This is why the Transformer scales by $d_k$ . Because $d_k = d_q$ .