*The study notes during [[Research]]* # The flow of Attention A transformer is made of multiple attention layers. The input to each attention layer, denoted as $X$ has the shape of `(B, L, D)`, where B is the batch size, L is the sequence length, and D is the hidden dimension. $X$ firstly goes through **a layer normalization**, where the normalization is applied across the channel (or feature) dimension. $X$ is then projected into three different vectors $K, Q, V$, each with the shape of `(B, L, H, D')`, **where H is the attention head number[^1]**, and D' is the hidden dimension. Next step is to calculate the self-attention score (also called Scaled dot-product attention) in the shape of `(B, H, L, L) $ \text{AttentionScore}(X)=\text{Softmax}(\frac{Q\times K^T}{\sqrt {D'}}) $ During training, dropout operation could be applied to scaled attention score, for [example](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention): ```python attn_weight = torch.softmax(attn_weight, dim=-1) attn_weight = torch.dropout(attn_weight, dropout_p, train=True) ``` **The output of attention layers ==with residual connection==** is: $ \text{out}=\text{Feedforward}(\text{AttentionScore}(X)\times X + X) + X $ Above, $\text{Feedforward}(\cdot)$ refers a fully-connected feedforward neural network, which processes each element in sequence independently. One implementation could be: ```python self.ffn = nn.Sequential( nn.Linear(dim, hidden_dim), nn.GELU(), nn.Dropout(dropout), nn.Linear(hidden_dim, dim), nn.Dropout(dropout) ) ``` [^1]: Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel.