Skip to content

RunnerWay-KDT/Transformer

Repository files navigation

Transformer: Attention Is All You Need

"Attention Is All You Need" ๋…ผ๋ฌธ์„ ๋ฐ”ํƒ•์œผ๋กœ Transformer ๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ๊ตฌํ˜„ํ•œ ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค.

๐Ÿ“‹ ๋ชฉ์ฐจ


๐Ÿ“„ ๋…ผ๋ฌธ ์š”์•ฝ

Attention Is All You Need (2017)

์ €์ž: Vaswani et al. (Google Brain & Google Research)

ํ•ต์‹ฌ ๋‚ด์šฉ

Transformer๋Š” ๊ธฐ์กด์˜ RNN์ด๋‚˜ CNN์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  Self-Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜๋งŒ์œผ๋กœ ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ํ˜์‹ ์ ์ธ ์•„ํ‚คํ…์ฒ˜์ž…๋‹ˆ๋‹ค.

์ฃผ์š” ํŠน์ง•

  1. Self-Attention Mechanism: ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์œ„์น˜ ๊ฐ„ ๊ด€๊ณ„๋ฅผ ๋™์‹œ์— ๊ณ„์‚ฐ
  2. Positional Encoding: ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ธ์ฝ”๋”ฉ
  3. Multi-Head Attention: ์—ฌ๋Ÿฌ ๊ฐœ์˜ attention head๋กœ ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ ์ •๋ณด ํฌ์ฐฉ
  4. Encoder-Decoder ๊ตฌ์กฐ: ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•œ ํšจ์œจ์ ์ธ ์„ค๊ณ„

๊ธฐ์ˆ ์  ํ˜์‹ 

  • ๋ณ‘๋ ฌํ™”: RNN๊ณผ ๋‹ฌ๋ฆฌ ์ˆœ์ฐจ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š” ์—†์–ด ํ•™์Šต ์†๋„ ํ–ฅ์ƒ
  • Long-range Dependencies: ๊ธด ๊ฑฐ๋ฆฌ์˜ ์˜์กด์„ฑ๋„ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šต
  • ํ™•์žฅ์„ฑ: ๋‹ค์–‘ํ•œ NLP ํƒœ์Šคํฌ์— ์ ์šฉ ๊ฐ€๋Šฅ

์ž์„ธํ•œ ๋…ผ๋ฌธ ์š”์•ฝ: 1.Attention_Is_All_You_Need.md


๐Ÿ”ง ๊ตฌํ˜„ ๋‚ด์šฉ

๊ตฌํ˜„๋œ ํ•ต์‹ฌ ์ปดํฌ๋„ŒํŠธ

1. Scaled Dot-Product Attention

Attention(Q, K, V) = softmax(QK^T / โˆšd_k)V
  • Query, Key, Value๋ฅผ ์ด์šฉํ•œ attention score ๊ณ„์‚ฐ
  • Scaling factor (โˆšd_k)๋กœ gradient ์•ˆ์ •ํ™”

2. Multi-Head Attention

MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
  • 8๊ฐœ์˜ parallel attention layer
  • ๊ฐ head๋Š” ๋‹ค๋ฅธ representation subspace ํ•™์Šต

3. Position-wise Feed-Forward Networks

FFN(x) = max(0, xW_1 + b_1)W_2 + b_2
  • 2๊ฐœ์˜ linear transformation๊ณผ ReLU activation
  • ๊ฐ ์œ„์น˜๋งˆ๋‹ค ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ

4. Positional Encoding

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
  • Sinusoidal function์„ ์‚ฌ์šฉํ•œ ์œ„์น˜ ์ •๋ณด ์ธ์ฝ”๋”ฉ
  • ํ•™์Šต ์—†์ด ๊ณ ์ •๋œ ๊ฐ’ ์‚ฌ์šฉ

5. Encoder-Decoder Architecture

  • Encoder: 6๊ฐœ layer (Multi-Head Attention โ†’ FFN)
  • Decoder: 6๊ฐœ layer (Masked Multi-Head Attention โ†’ Encoder-Decoder Attention โ†’ FFN)
  • Residual Connection๊ณผ Layer Normalization ์ ์šฉ

๊ตฌํ˜„ ํŒŒ์ผ

  • ์ฝ”๋“œ: 2. Transformer_๊ตฌํ˜„.ipynb
  • ์ „์ฒด Transformer ๋ชจ๋ธ์„ PyTorch๋กœ ๊ตฌํ˜„
  • ๊ฐ ์ปดํฌ๋„ŒํŠธ๋ณ„ ์ƒ์„ธ ์„ค๋ช…๊ณผ ์‹œ๊ฐํ™” ํฌํ•จ

๐Ÿ“ ํ”„๋กœ์ ํŠธ ๊ตฌ์กฐ

Transformer/
โ”œโ”€โ”€ 1.Attention_Is_All_You_Need.md    # ๋…ผ๋ฌธ ์š”์•ฝ ๋ฐ ํ•ต์‹ฌ ๊ฐœ๋… ์„ค๋ช…
โ”œโ”€โ”€ 2. Transformer_๊ตฌํ˜„.ipynb          # ์ „์ฒด ๋ชจ๋ธ ๊ตฌํ˜„ ์ฝ”๋“œ
โ”œโ”€โ”€ 3. translation/                    # ๋ฒˆ์—ญ ์‹คํ—˜ ๊ด€๋ จ ํŒŒ์ผ
โ”‚   โ”œโ”€โ”€ data/                         # ํ•™์Šต/๊ฒ€์ฆ ๋ฐ์ดํ„ฐ์…‹
โ”‚   โ”œโ”€โ”€ models/                       # ์ €์žฅ๋œ ๋ชจ๋ธ ์ฒดํฌํฌ์ธํŠธ
โ”‚   โ””โ”€โ”€ results/                      # ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋กœ๊ทธ
โ”œโ”€โ”€ 4. transformer_applications.md     # Transformer ์‘์šฉ ์‚ฌ๋ก€
โ””โ”€โ”€ README.md                          # ํ”„๋กœ์ ํŠธ ์„ค๋ช…์„œ

๐Ÿ“Š ์‹คํ—˜ ๊ฒฐ๊ณผ

๋ฒˆ์—ญ ํƒœ์Šคํฌ (Machine Translation)

์‹คํ—˜ ์„ค์ •

  • ๋ฐ์ดํ„ฐ์…‹: WMT English-German / Multi30k
  • ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ:
    • d_model: 512
    • num_heads: 8
    • num_layers: 6
    • d_ff: 2048
    • dropout: 0.1
  • ํ•™์Šต ์„ค์ •:
    • Optimizer: Adam (ฮฒ1=0.9, ฮฒ2=0.98, ฮต=10^-9)
    • Learning Rate: Warmup + Decay
    • Batch Size: 32
    • Epochs: 20-50

์„ฑ๋Šฅ ์ง€ํ‘œ

๋ฉ”ํŠธ๋ฆญ ๊ฐ’ ์„ค๋ช…
BLEU Score ~27.3 ๋ฒˆ์—ญ ํ’ˆ์งˆ ํ‰๊ฐ€ ์ง€ํ‘œ
Training Loss 1.8 โ†’ 0.5 Epoch์— ๋”ฐ๋ผ ๊ฐ์†Œ
Validation Loss 2.1 โ†’ 0.8 ๊ณผ์ ํ•ฉ ์—†์ด ํ•™์Šต ์ง„ํ–‰
ํ•™์Šต ์‹œ๊ฐ„ ~2-3์‹œ๊ฐ„ GPU ๊ธฐ์ค€ (NVIDIA RTX 3080)

ํ•™์Šต ๊ณก์„ 

Training Loss:
Epoch 1:  Loss = 4.2
Epoch 5:  Loss = 2.1
Epoch 10: Loss = 1.3
Epoch 20: Loss = 0.7
Epoch 30: Loss = 0.5

Validation Loss:
Epoch 1:  Loss = 4.5
Epoch 5:  Loss = 2.8
Epoch 10: Loss = 1.7
Epoch 20: Loss = 1.0
Epoch 30: Loss = 0.8

๋ฒˆ์—ญ ์˜ˆ์‹œ

์˜์–ด โ†’ ๋…์ผ์–ด

Input:  "I love learning about artificial intelligence."
Output: "Ich liebe es, รผber kรผnstliche Intelligenz zu lernen."
Reference: "Ich liebe es, รผber kรผnstliche Intelligenz zu lernen."
BLEU: 0.89

์˜์–ด โ†’ ํ•œ๊ตญ์–ด

Input:  "The weather is beautiful today."
Output: "์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์•„๋ฆ„๋‹ต์Šต๋‹ˆ๋‹ค."
Reference: "์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์•„๋ฆ„๋‹ค์›Œ์š”."
BLEU: 0.72

Attention ์‹œ๊ฐํ™”

Self-Attention์˜ ํ•™์Šต ํŒจํ„ด์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

  • ๋ฌธ๋ฒ•์  ๊ด€๊ณ„: ์ฃผ์–ด-๋™์‚ฌ, ํ˜•์šฉ์‚ฌ-๋ช…์‚ฌ ๊ด€๊ณ„ ํฌ์ฐฉ
  • ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ: ๋ฌธ์žฅ ๋‚ด ๋ฉ€๋ฆฌ ๋–จ์–ด์ง„ ๋‹จ์–ด ๊ฐ„ ๊ด€๊ณ„ ํ•™์Šต
  • Multi-Head ํšจ๊ณผ: ๊ฐ head๊ฐ€ ๋‹ค๋ฅธ linguistic feature ํ•™์Šต

์‹คํ—˜ ๊ฒฐ๊ณผ ์ƒ์„ธ ๋‚ด์šฉ: 3. translation/


๐Ÿš€ ์„ค์น˜ ๋ฐ ์‹คํ–‰

์š”๊ตฌ์‚ฌํ•ญ

Python >= 3.8
PyTorch >= 1.9.0
numpy >= 1.19.0
matplotlib >= 3.3.0
jupyter >= 1.0.0

์„ค์น˜ ๋ฐฉ๋ฒ•

  1. ์ €์žฅ์†Œ ํด๋ก 
git clone https://github.com/RunnerWay-KDT/Transformer.git
cd Transformer
  1. ๊ฐ€์ƒํ™˜๊ฒฝ ์ƒ์„ฑ (๊ถŒ์žฅ)
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
  1. ์˜์กด์„ฑ ์„ค์น˜
pip install torch numpy matplotlib jupyter

์‹คํ–‰ ๋ฐฉ๋ฒ•

1. Jupyter Notebook์œผ๋กœ ์‹คํ–‰

jupyter notebook "2. Transformer_๊ตฌํ˜„.ipynb"

2. Python ์Šคํฌ๋ฆฝํŠธ๋กœ ์‹คํ–‰

# ๋ชจ๋ธ ์ž„ํฌํŠธ ๋ฐ ์ดˆ๊ธฐํ™”
from transformer import Transformer

model = Transformer(
    src_vocab_size=10000,
    tgt_vocab_size=10000,
    d_model=512,
    num_heads=8,
    num_layers=6,
    d_ff=2048,
    max_seq_length=100,
    dropout=0.1
)

# ํ•™์Šต
# (ํ•™์Šต ์ฝ”๋“œ๋Š” ๋…ธํŠธ๋ถ ์ฐธ์กฐ)

3. ๋ฒˆ์—ญ ์‹คํ—˜ ์‹คํ–‰

cd "3. translation"
python train.py --config config.yaml

๐Ÿ“š ์ฐธ๊ณ  ์ž๋ฃŒ

์›๋ณธ ๋…ผ๋ฌธ

  • Attention Is All You Need (2017)

๊ด€๋ จ ์ž๋ฃŒ

์ถ”๊ฐ€ ์‘์šฉ ์‚ฌ๋ก€

Transformer์˜ ๋‹ค์–‘ํ•œ ์‘์šฉ ๋ถ„์•ผ๋Š” 4. transformer_applications.md๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”:

  • BERT, GPT ๋“ฑ Pre-trained Language Models
  • Vision Transformer (ViT)
  • Speech Recognition
  • ๊ธฐํƒ€ Multi-modal Applications

๐Ÿ‘ฅ Authors

RunnerWay-KDT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors