diff --git a/src/components/article/Article.svelte b/src/components/article/Article.svelte index 607b84c..55b6cfc 100644 --- a/src/components/article/Article.svelte +++ b/src/components/article/Article.svelte @@ -9,16 +9,16 @@

What is a Transformer?

- Transformer is a neural network architecture that has fundamentally changed the approach to - Artificial Intelligence. Transformer was first introduced in the seminal paper + The Transformer is a neural network architecture that has fundamentally changed the approach to + Artificial Intelligence. Transformers were first introduced in the seminal paper "Attention is All You Need" - in 2017 and has since become the go-to architecture for deep learning models, powering text-generative + in 2017 and have since become the go-to architecture for deep learning models, powering text-generative models like OpenAI's GPT, Meta's Llama, and Google's - Gemini. Beyond text, Transformer is also applied in + Gemini. Beyond text, Transformers are also applied in game playing, demonstrating its versatility across numerous domains. + >, demonstrating their versatility across numerous domains.

- Fundamentally, text-generative Transformer models operate on the principle of next-token prediction: given a text prompt from the user, what is the most probable next token (a word or part of a word) that will follow this input? The core @@ -311,10 +311,11 @@ between all input tokens.

  • - Scaling · Mask: The attention scores are scaled and a mask is applied to + Scaling · Mask: The attention scores are scaled and a (causal) mask is applied to the upper triangle of the attention matrix to prevent the model from accessing future tokens, setting these values to negative infinity. The model needs to learn how to predict - the next token without “peeking” into the future. + the next token without “peeking” into the future. GPTs often use a padding-mask in addition to the + decoder causal mask to ignore padding tokens.
  • Softmax · Dropout: After masking and scaling, the attention scores are