azatkariuly

September 30, 2025

Attention is All You Need - From Scratch (PyTorch)

I’ve always heard the phrase “attention is all you need” floating around in the ML world, and I finally decided to see what the fuss was about.

Paper: Attention is all you need
The full code is available on github: Link


Table of Contents


Embeddings

In a Transformer, the first step is to convert input tokens (words, subwords, or characters) into vectors that a neural network can understand.

Raw tokens are just integers (e.g., "cat" → 42 in the vocabulary). But numbers like 42 or 105 don’t carry semantic meaning. That’s why we use an embedding layer: it maps each token ID into a dense vector of fixed dimension.

  • Similar words get similar embeddings.
  • These vectors become the input to the attention mechanism.

As referenced in the original paper (Vaswani et al., 2017), we scale embeddings by √(d_model) to ensure that the variance of dot products remains roughly independent of the embedding dimension, stabilizing training.
Here, d_model is the size of the embedding vector - the number of dimensions used to represent each token. For example, if d_model = 512, each token is represented as a 512-dimensional vector. This dimensionality is used consistently throughout the Transformer for embeddings, queries, keys, values, and the hidden layers in the model. Scaling by √(d_model) keeps the values at a reasonable scale when computing attention.


Code: Token Embedding Layer

import torch
import torch.nn as nn

class InputEmbedding(nn.Module):
    def __init__(self, d_model: int, vocab_size: int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.embedding(x) * (self.d_model ** 0.5)

Positional Encoding

In a Transformer, the model has no inherent sense of word order, because attention treats the input as a set of tokens.
To give the model information about the position of each token in a sequence, we use positional encoding.

Positional encoding adds a vector to each token embedding that represents its position in the sequence.
This allows the model to distinguish "cat sat" from "sat cat" even though the token embeddings themselves are the same.


Sinusoidal Positional Encoding

The original Transformer paper (Vaswani et al., 2017) uses a sinusoidal function:

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

  • pos = token position in the sequence
  • i = dimension index
  • d_model = size of the embedding vector

This allows the model to generalize to sequences longer than those seen during training.


Code Implementation

class PositionalEncoding(nn.Module):
    def __init__(self, d_model: int, seq_len: int, dropout: float):
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)

        # create a matrix of shape (seq_len, d_model)
        pe = torch.zeros(seq_len, d_model)
        position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1) # (seq_len, 1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-torch.log(torch.tensor(10000.0)) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)

        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + (self.pe[:, :x.size(1), :]).requires_grad_(False)
        return self.dropout(x)