*The study notes during [[Research]]*
# The flow of Attention
A transformer is made of multiple attention layers.
The input to each attention layer, denoted as $X$ has the shape of `(B, L, D)`, where B is the batch size, L is the sequence length, and D is the hidden dimension.
$X$ firstly goes through **a layer normalization**, where the normalization is applied across the channel (or feature) dimension.
$X$ is then projected into three different vectors $K, Q, V$, each with the shape of `(B, L, H, D')`, **where H is the attention head number[^1]**, and D' is the hidden dimension.
Next step is to calculate the self-attention score (also called Scaled dot-product attention) in the shape of `(B, H, L, L)
$
\text{AttentionScore}(X)=\text{Softmax}(\frac{Q\times K^T}{\sqrt {D'}})
$
During training, dropout operation could be applied to scaled attention score, for [example](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention):
```python
attn_weight = torch.softmax(attn_weight, dim=-1)
attn_weight = torch.dropout(attn_weight, dropout_p, train=True)
```
**The output of attention layers ==with residual connection==** is:
$
\text{out}=\text{Feedforward}(\text{AttentionScore}(X)\times X + X) + X
$
Above, $\text{Feedforward}(\cdot)$ refers a fully-connected feedforward neural network, which processes each element in sequence independently. One implementation could be:
```python
self.ffn = nn.Sequential(
nn.Linear(dim, hidden_dim),
nn.GELU(),
nn.Dropout(dropout),
nn.Linear(hidden_dim, dim),
nn.Dropout(dropout)
)
```
[^1]: Rather than only computing the attention once, the multi-head mechanism runs through the scaled dot-product attention multiple times in parallel.