README generated by GPT
A repository exploring GPT and Large Language Models (LLMs), covering fundamental concepts, training methods, and practical applications.
This repository documents an educational journey through understanding and implementing GPT and related language models.
- Pre-training: Training on large-scale datasets to learn general language patterns.
- Fine-tuning: Further training on specific datasets tailored to certain tasks or domains.
- Introduced in "Attention is All You Need".
- Components:
- Encoder: Encodes input text into contextual vectors.
- Decoder: Generates output text from encoded vectors.
- Self-attention: Captures dependencies between words in text.
- Extensions:
- BERT: Encoder-based, optimal for classification tasks.
- GPT: Decoder-based, ideal for generating text predictions.
- Initially introduced in "Improving Language Understanding by Generative Pre-Training".
- Expanded with GPT-3, significantly increasing layers (96) and parameters (175 billion).
- Demonstrated emergent behaviors like effective translation without explicit training.
- GPT-3 trained using various large datasets, costing approximately $4.6 million.
- Relevant datasets:
Blueprint:
-
Stage 1: Data & Architecture
- Data Preparation
- Attention Mechanism
- Architecture Selection
-
Stage 2: Pre-training
- Training Loop
- Evaluation
- Load Pretrained Weights
-
Stage 3: Fine-tuning
- Classifier or Personal Assistant Training
- nvim:
wsl nvim <filename>
- Needs PyTorch/autocompletion setup.
- TorchStudio: User-friendly, may require further tutorials.
- Convert unstructured data into continuous vectors.
- Types: Word embeddings (Word2Vec), paragraph embeddings.
- Dimensionality: GPT-3 uses 12,288-dimensional embeddings.
-
Simple Regex Tokenizer:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
-
Simple Vocabulary Builder:
- Maps words to unique integers for token encoding.
-
Contextual Tokens:
- Special tokens (
<|endoftext|>,<unk>,[PAD]) enhance model capability.
- Special tokens (
-
Byte-Pair Encoding (BPE):
- Used by GPT models for efficient tokenization.
- Python implementation:
import tiktoken tokenizer = tiktoken.get_encoding("gpt-2") tokenizer.encode("Example text")
-
Create input-target pairs:
tokenizer = tiktoken.get_encoding("gpt2") encoded_text = tokenizer.encode(read_file("the-verdict.txt"))
-
PyTorch efficient data handling:
class GPTDataset(Dataset): def __init__(self, txt, tokenizer, max_length, stride): # Initialization logic
- Explore BPE in depth (Hugging Face Course).
- Experiment with data sampling parameters for optimal performance.