Transformer Practice

SDPA

Scaled Dot-Product Attention. This is the model with only one attention head from the Transformer lesson.

As shown above, generating multiple Q, K, V sets is MHA; generating just one set is SDPA.

![](/assets/images/Transformer 실습/2caaf910-d820-4ce6-a286-b6497e5928e9-image.png)

1
class ScaledDotProductAttention(nn.Module):
2
    """
3
    Structure with a single attention head.
4
    Input: n-dimensional vector from embedding result.
5

6
    Finds query, key, value and performs attention computation
7
    as shown in the formula above.
8

9
    Output: tensor with dimension equal to value vector dimension
10
    for n word vectors.
11
    """
12
    def forward(self, Q, K, V, mask=None):
13
        d_K = K.size()[-1] # key dimension
14
        scores = Q.matmul(K.transpose(-2, -1)) / np.sqrt(d_K)
15
        if mask is not None:
16
            scores = scores.masked_fill(mask==0, -1e9)
17
        attention = F.softmax(scores,dim=-1)
18
        out = attention.matmul(V)
19
        return out,attention
20

21
# Demo run of scaled dot product attention
22

23
SPDA = ScaledDotProductAttention()
24
"""
25
n_batch: there are n_batch words
26
d_K: key vector dimension
27
d_V: Value vector dimension
28
n_Q: number of Query vectors
29
n_K: number of Key vectors
30
n_V: number of Value vectors
31
"""
32
n_batch, d_K, d_V = 3, 128, 256 # d_K(=d_Q) does not necessarily be equal to d_V
33
n_Q, n_K, n_V = 30,50,50
34
Q = torch.rand(n_batch,n_Q,d_K)
35
K = torch.rand(n_batch,n_K,d_K)
36
V = torch.rand(n_batch,n_V,d_V)
37
out,attention = SPDA.forward(Q,K,V,mask=None)
38
def sh(x): return str(x.shape)[11:-1]
39
print ("SDPA: Q%s K%s V%s => out%s attention%s"%
40
       (sh(Q),sh(K),sh(V),sh(out),sh(attention)))

As shown in the formula, query and key dimensions are $\mathbb{R}^{n\times d_K}$ . That is, query and key must have the same dimension for the operation to work.

Value dimension is $\mathbb{R}^{n\times d_V}$ , but in practice it’s implemented the same as query and key for convenience. They can be equal.

Number of Q, K, V

With encoder and decoder

Looking at the code, SPDA vectors are generated as:

n_Q $\neq$ ( n_K = n_V)

V and K come from the encoder, while the decoder creates Q from its own input, so the counts can differ.

This is the more general case since it assumes encoder-decoder.

The purpose of SPDA becomes clear here! We want to encode Query vectors by referencing key and value vectors.

So SPDA’s output vectors must match the Query vector count.

For self-attention

n_Q = n_V = n_K

All must be equal.

K.transpose(-2, -1)

PyTorch tensors support transpose like this. It swaps the two dimensions given as arguments. Here, it swaps the last and second-to-last dimensions.

torch.nn.Softmax()

I didn’t know what dim = -1 meant, so I checked the docs:

dim (int) - A dimension along which Softmax will be computed (so every slice along dim will sum to 1).

https://stackoverflow.com/questions/49036993/pytorch-softmax-what-dimension-to-use

It computes softmax along the specified dimension.

![](/assets/images/Transformer 실습/4a19b3f9-5b06-497f-84dd-b90f8839e484-image.png)

The definition of softmax is as above, where $x_j$ ‘s j is specified via the dim option.

Why SDPA code also works for MHA

The instructor said it’s because of “batch the multiplication.” I’m not entirely sure what was meant.

My take: since SDPA is implemented via matrix operations, it works regardless of how many dimensions Q, K, V have — you just need to match dimension counts.

MHA (Multi-Head Attention)

1
class MultiHeadedAttention(nn.Module):
2
    def __init__(self,d_feat=128,n_head=5,actv=F.relu,USE_BIAS=True,dropout_p=0.1,device=None):
3
        """
4
        :param d_feat: feature dimension
5
        :param n_head: number of heads
6
        :param actv: activation after each linear layer
7
        :param USE_BIAS: whether to use bias
8
        :param dropout_p: dropout rate
9
        :device: which device to use (e.g., cuda:0)
10
        """
11
        super(MultiHeadedAttention,self).__init__()
12
        if (d_feat%n_head) != 0:
13
            raise ValueError("d_feat(%d) should be divisible by b_head(%d)"%(d_feat,n_head))
14
        self.d_feat = d_feat
15
        self.n_head = n_head
16
        self.d_head = self.d_feat // self.n_head
17
        self.actv = actv
18
        self.USE_BIAS = USE_BIAS
19
        self.dropout_p = dropout_p # prob. of zeroed
20

21
        self.lin_Q = nn.Linear(self.d_feat,self.d_feat,self.USE_BIAS)
22
        self.lin_K = nn.Linear(self.d_feat,self.d_feat,self.USE_BIAS)
23
        self.lin_V = nn.Linear(self.d_feat,self.d_feat,self.USE_BIAS)
24
        self.lin_O = nn.Linear(self.d_feat,self.d_feat,self.USE_BIAS)
25

26
        self.dropout = nn.Dropout(p=self.dropout_p)
27

28
    def forward(self,Q,K,V,mask=None):
29
        """
30
        :param Q: [n_batch, n_Q, d_feat]
31
        :param K: [n_batch, n_K, d_feat]
32
        :param V: [n_batch, n_V, d_feat] <= n_K and n_V must be the same
33
        :param mask:
34
        """
35
        n_batch = Q.shape[0]
36
        Q_feat = self.lin_Q(Q)
37
        K_feat = self.lin_K(K)
38
        V_feat = self.lin_V(V)
39
        # Q_feat: [n_batch, n_Q, d_feat]
40
        # K_feat: [n_batch, n_K, d_feat]
41
        # V_feat: [n_batch, n_V, d_feat]
42

43
        # Multi-head split of Q, K, and V (d_feat = n_head*d_head)
44
        """
45
        Split Q, K, V. For example, (100,) becomes (10,10).
46
        Here, d_feat is split into n_head parts of d_head dimension.
47
        """
48
        Q_split = Q_feat.view(n_batch, -1, self.n_head, self.d_head).permute(0, 2, 1, 3)
49
        K_split = K_feat.view(n_batch, -1, self.n_head, self.d_head).permute(0, 2, 1, 3)
50
        V_split = V_feat.view(n_batch, -1, self.n_head, self.d_head).permute(0, 2, 1, 3)
51
        # Q_split: [n_batch, n_head, n_Q, d_head]
52
        # K_split: [n_batch, n_head, n_K, d_head]
53
        # V_split: [n_batch, n_head, n_V, d_head]
54

55
        # Multi-Headed Attention
56
        d_K = K.size()[-1] # key dimension
57
        scores = torch.matmul(Q_split, K_split.permute(0, 1, 3, 2)) / np.sqrt(d_K)
58
        if mask is not None:
59
            scores = scores.masked_fill(mask==0,-1e9)
60
        attention = torch.softmax(scores,dim=-1)
61
        x_raw = torch.matmul(self.dropout(attention),V_split) # dropout is NOT mentioned in the paper
62
        # attention: [n_batch, n_head, n_Q, n_K]
63
        # x_raw: [n_batch, n_head, n_Q, d_head]
64

65
        # Reshape x
66
        x_rsh1 = x_raw.permute(0,2,1,3).contiguous()
67
        # x_rsh1: [n_batch, n_Q, n_head, d_head]
68
        """
69
        Merge the tensor that was split into n_head parts of d_head dimension.
70
        n_head * d_head = d_feat, so we use d_feat directly.
71
        """
72
        x_rsh2 = x_rsh1.view(n_batch,-1,self.d_feat)
73
        # x_rsh2: [n_batch, n_Q, d_feat]
74

75
        # Linear
76
        x = self.lin_O(x_rsh2)
77
        # x: [n_batch, n_Q, d_feat]
78
        out = {'Q_feat':Q_feat,'K_feat':K_feat,'V_feat':V_feat,
79
               'Q_split':Q_split,'K_split':K_split,'V_split':V_split,
80
               'scores':scores,'attention':attention,
81
               'x_raw':x_raw,'x_rsh1':x_rsh1,'x_rsh2':x_rsh2,'x':x}
82
        return out
83

84
# Self-Attention Layer
85
"""
86
n_batch: take 128 words per batch from training data.
87
n_src: n_src words go in = process n_src sequence elements at once.
88
d_feat: feature dimension
89
n_head: how many heads for multi-head attention
90
"""
91
n_batch = 128
92
n_src   = 32
93
d_feat  = 200
94
n_head  = 5
95
src = torch.rand(n_batch,n_src,d_feat)
96
self_attention = MultiHeadedAttention(
97
    d_feat=d_feat,n_head=n_head,actv=F.relu,USE_BIAS=True,dropout_p=0.1,device=device)
98

99
# Since it's self attention, Q, K, V all have the same dimension
100
out = self_attention.forward(src,src,src,mask=None)
101

102
Q_feat,K_feat,V_feat = out['Q_feat'],out['K_feat'],out['V_feat']
103
Q_split,K_split,V_split = out['Q_split'],out['K_split'],out['V_split']
104
scores,attention = out['scores'],out['attention']
105
x_raw,x_rsh1,x_rsh2,x = out['x_raw'],out['x_rsh1'],out['x_rsh2'],out['x']

$head_{\color{red}i} = \text{Attention}(Q {\color{green}W}^Q_{\color{red}i},K {\color{green}W}^K_{\color{red}i}, V {\color{green}W}^V_{\color{red}i})$

The paper doesn’t include dropout. But in practice, dropout is used in all attention layers, so it’s included here.
The original MHA creates k separate headers and aggregates the results later.
- The actual implementation splits into k parts upfront and runs Scaled Dot-Product.
- Therefore, d_feat must be divisible by n_head.

torch.Tensor.permute

Same functionality as transpose. The difference is that transpose only swaps two dimensions, while permute works on all dimensions.

Takeaway

This can be slightly confusing, so to summarize:

n_Q $\neq$ ( n_K = n_V)
d_Q = d_K

Why #1 holds:

Key and Value come from the encoder.
Query is the input received by the decoder.

Why #2 holds:

Query and Key must be inner-producted for attention, so they need the same dimension.
Value dimension can differ from both.