Attention: Keys and Queries

Drag the red query vector Q to see how attention scores change for each key vector:

Attention Scores

Query Q: (0.62, 0.62)

K(水) Q·K = 0.000

Attention: 0.000

K(風) Q·K = 0.000

Attention: 0.000

K(有) Q·K = 0.000

Attention: 0.000

Attention Mechanism Formula

$$\text{Attention}(Q, K) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \approx \text{softmax}(Q \cdot K^T)$$

Current calculation:

$$\begin{align} &Q \cdot K_{\text{水}} = 0.000, \quad Q \cdot K_{\text{風}} = 0.000, \quad Q \cdot K_{\text{有}} = 0.000 \\[0.5em] &\text{Attention: } \alpha_{\text{水}} = 0.333, \quad \alpha_{\text{風}} = 0.333, \quad \alpha_{\text{有}} = 0.333 \end{align}$$

Show detailed calculations

How It Works

This visualization demonstrates the core attention mechanism used in transformer models:

Key Vectors (K): Three fixed vectors labeled with Chinese characters 水 (water), 風 (wind), and 有 (have), shown in black.
Query Vector (Q): A movable red vector that you can drag around to explore different positions.
Dot Products: For each key vector, we compute the dot product with the query: Q · K
Softmax: The dot products are passed through a softmax function to produce attention weights that sum to 1.
Interpretation: Higher attention weights indicate stronger relevance between the query and that particular key. When Q is close to a key vector (small angle), the attention weight for that key increases.

Note: In the full transformer attention formula, dot products are scaled by 1/√d_k for numerical stability with high-dimensional vectors, but for this 2D visualization (d_k=2), we omit this scaling factor as it has minimal impact on the observed behavior.

Try moving the query vector closer to different keys and observe how the attention distribution changes!

Attention Mechanism: Keys and Queries

Attention Scores

Attention Mechanism Formula

Detailed Step-by-Step Calculations

How It Works