Drag the red query vector Q to see how attention scores change for each key vector:
Attention Scores
Query Q:
(0.62, 0.62)
K(水)
Q·K = 0.000
Attention:
0.000
K(風)
Q·K = 0.000
Attention:
0.000
K(有)
Q·K = 0.000
Attention:
0.000
Attention Mechanism Formula
$$\text{Attention}(Q, K) = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) \approx \text{softmax}(Q \cdot K^T)$$
Current calculation:
$$\begin{align}
&Q \cdot K_{\text{水}} = 0.000, \quad Q \cdot K_{\text{風}} = 0.000, \quad Q \cdot K_{\text{有}} = 0.000 \\[0.5em]
&\text{Attention: } \alpha_{\text{水}} = 0.333, \quad \alpha_{\text{風}} = 0.333, \quad \alpha_{\text{有}} = 0.333
\end{align}$$
How It Works
This visualization demonstrates the core attention mechanism used in transformer models:
- Key Vectors (K): Three fixed vectors labeled with Chinese characters 水 (water), 風 (wind), and 有 (have), shown in black.
- Query Vector (Q): A movable red vector that you can drag around to explore different positions.
- Dot Products: For each key vector, we compute the dot product with the query:
Q · K - Softmax: The dot products are passed through a softmax function to produce attention weights that sum to 1.
- Interpretation: Higher attention weights indicate stronger relevance between the query and that particular key. When Q is close to a key vector (small angle), the attention weight for that key increases.
Note: In the full transformer attention formula, dot products are scaled by 1/√dk for numerical stability with high-dimensional vectors, but for this 2D visualization (dk=2), we omit this scaling factor as it has minimal impact on the observed behavior.
Try moving the query vector closer to different keys and observe how the attention distribution changes!