Attention Mechanism: Keys and Queries

Drag the red query vector Q to see how attention scores change for each key vector:

Attention Scores

Query Q: (0.62, 0.62)
K(水) Q·K = 0.000
Attention: 0.000
K(風) Q·K = 0.000
Attention: 0.000
K(有) Q·K = 0.000
Attention: 0.000

Attention Mechanism Formula

$$\text{Attention}(Q, K) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \approx \text{softmax}(Q \cdot K^T)$$
Current calculation:
$$\begin{align} &Q \cdot K_{\text{水}} = 0.000, \quad Q \cdot K_{\text{風}} = 0.000, \quad Q \cdot K_{\text{有}} = 0.000 \\[0.5em] &\text{Attention: } \alpha_{\text{水}} = 0.333, \quad \alpha_{\text{風}} = 0.333, \quad \alpha_{\text{有}} = 0.333 \end{align}$$

How It Works

This visualization demonstrates the core attention mechanism used in transformer models:

  • Key Vectors (K): Three fixed vectors labeled with Chinese characters 水 (water), 風 (wind), and 有 (have), shown in black.
  • Query Vector (Q): A movable red vector that you can drag around to explore different positions.
  • Dot Products: For each key vector, we compute the dot product with the query: Q · K
  • Softmax: The dot products are passed through a softmax function to produce attention weights that sum to 1.
  • Interpretation: Higher attention weights indicate stronger relevance between the query and that particular key. When Q is close to a key vector (small angle), the attention weight for that key increases.

Note: In the full transformer attention formula, dot products are scaled by 1/√dk for numerical stability with high-dimensional vectors, but for this 2D visualization (dk=2), we omit this scaling factor as it has minimal impact on the observed behavior.

Try moving the query vector closer to different keys and observe how the attention distribution changes!