From 01bbc09565ecea80f35be45dadd549c234bd232d Mon Sep 17 00:00:00 2001 From: damoebe <100862528+damoebe@users.noreply.github.com> Date: Sat, 28 Feb 2026 14:09:51 +0100 Subject: [PATCH 1/2] Fix grammatical errors + better generative Transformer description --- src/components/article/Article.svelte | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/src/components/article/Article.svelte b/src/components/article/Article.svelte index 607b84c..ee624a2 100644 --- a/src/components/article/Article.svelte +++ b/src/components/article/Article.svelte @@ -9,16 +9,16 @@

What is a Transformer?

- Transformer is a neural network architecture that has fundamentally changed the approach to - Artificial Intelligence. Transformer was first introduced in the seminal paper + The Transformer is a neural network architecture that has fundamentally changed the approach to + Artificial Intelligence. Transformers were first introduced in the seminal paper "Attention is All You Need" - in 2017 and has since become the go-to architecture for deep learning models, powering text-generative + in 2017 and have since become the go-to architecture for deep learning models, powering text-generative models like OpenAI's GPT, Meta's Llama, and Google's - Gemini. Beyond text, Transformer is also applied in + Gemini. Beyond text, Transformers are also applied in game playing, demonstrating its versatility across numerous domains. + >, demonstrating their versatility across numerous domains.

- Fundamentally, text-generative Transformer models operate on the principle of next-token prediction: given a text prompt from the user, what is the most probable next token (a word or part of a word) that will follow this input? The core From 7c7af992d32866cb6c2a282a0acc4053912f2fee Mon Sep 17 00:00:00 2001 From: damoebe <100862528+damoebe@users.noreply.github.com> Date: Sat, 28 Feb 2026 15:58:38 +0100 Subject: [PATCH 2/2] Enhance attention score explanation in Article.svelte Clarify the explanation of the scaling and masking process in attention scores, including the use of padding-masks. --- src/components/article/Article.svelte | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/components/article/Article.svelte b/src/components/article/Article.svelte index ee624a2..55b6cfc 100644 --- a/src/components/article/Article.svelte +++ b/src/components/article/Article.svelte @@ -311,10 +311,11 @@ between all input tokens.

  • - Scaling · Mask: The attention scores are scaled and a mask is applied to + Scaling · Mask: The attention scores are scaled and a (causal) mask is applied to the upper triangle of the attention matrix to prevent the model from accessing future tokens, setting these values to negative infinity. The model needs to learn how to predict - the next token without “peeking” into the future. + the next token without “peeking” into the future. GPTs often use a padding-mask in addition to the + decoder causal mask to ignore padding tokens.
  • Softmax · Dropout: After masking and scaling, the attention scores are