From 01bbc09565ecea80f35be45dadd549c234bd232d Mon Sep 17 00:00:00 2001
From: damoebe <100862528+damoebe@users.noreply.github.com>
Date: Sat, 28 Feb 2026 14:09:51 +0100
Subject: [PATCH 1/2] Fix grammatical errors + better generative Transformer
 description

---
 src/components/article/Article.svelte | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/src/components/article/Article.svelte b/src/components/article/Article.svelte
index 607b84c..ee624a2 100644
--- a/src/components/article/Article.svelte
+++ b/src/components/article/Article.svelte
@@ -9,16 +9,16 @@
 		<h1>What is a Transformer?</h1>
 
 		<p>
-			Transformer is a neural network architecture that has fundamentally changed the approach to
-			Artificial Intelligence. Transformer was first introduced in the seminal paper
+			The Transformer is a neural network architecture that has fundamentally changed the approach to
+			Artificial Intelligence. Transformers were first introduced in the seminal paper
 			<a
 				href="https://dl.acm.org/doi/10.5555/3295222.3295349"
 				title="ACM Digital Library"
 				target="_blank">"Attention is All You Need"</a
 			>
-			in 2017 and has since become the go-to architecture for deep learning models, powering text-generative
+			in 2017 and have since become the go-to architecture for deep learning models, powering text-generative
 			models like OpenAI's <strong>GPT</strong>, Meta's <strong>Llama</strong>, and Google's
-			<strong>Gemini</strong>. Beyond text, Transformer is also applied in
+			<strong>Gemini</strong>. Beyond text, Transformers are also applied in
 			<a
 				href="https://huggingface.co/learn/audio-course/en/chapter3/introduction"
 				title="Hugging Face"
@@ -36,10 +36,10 @@
 				href="https://www.deeplearning.ai/the-batch/reinforcement-learning-plus-transformers-equals-efficiency/"
 				title="Deep Learning AI"
 				target="_blank">game playing</a
-			>, demonstrating its versatility across numerous domains.
+			>, demonstrating their versatility across numerous domains.
 		</p>
 		<p>
-			Fundamentally, text-generative Transformer models operate on the principle of <strong
+			Fundamentally, generative Transformer models are based on the Decoder-Only Transformer architechture and operate on the principle of <strong
 				>next-token prediction</strong
 			>: given a text prompt from the user, what is the
 			<em>most probable next token (a word or part of a word)</em> that will follow this input? The core

From 7c7af992d32866cb6c2a282a0acc4053912f2fee Mon Sep 17 00:00:00 2001
From: damoebe <100862528+damoebe@users.noreply.github.com>
Date: Sat, 28 Feb 2026 15:58:38 +0100
Subject: [PATCH 2/2] Enhance attention score explanation in Article.svelte

Clarify the explanation of the scaling and masking process in attention scores, including the use of padding-masks.
---
 src/components/article/Article.svelte | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/components/article/Article.svelte b/src/components/article/Article.svelte
index ee624a2..55b6cfc 100644
--- a/src/components/article/Article.svelte
+++ b/src/components/article/Article.svelte
@@ -311,10 +311,11 @@
 					between all input tokens.
 				</li>
 				<li>
-					<strong>Scaling · Mask</strong>: The attention scores are scaled and a mask is applied to
+					<strong>Scaling · Mask</strong>: The attention scores are scaled and a (causal) mask is applied to
 					the upper triangle of the attention matrix to prevent the model from accessing future
 					tokens, setting these values to negative infinity. The model needs to learn how to predict
-					the next token without “peeking” into the future.
+					the next token without “peeking” into the future. GPTs often use a padding-mask in addition to the 
+					decoder causal mask to ignore padding tokens.
 				</li>
 				<li>
 					<strong>Softmax · Dropout</strong>: After masking and scaling, the attention scores are