My GPT LLM

README generated by GPT

My GPT LLM

A repository exploring GPT and Large Language Models (LLMs), covering fundamental concepts, training methods, and practical applications.

This repository documents an educational journey through understanding and implementing GPT and related language models.

Introduced in "Attention is All You Need".
Components:
- Encoder: Encodes input text into contextual vectors.
- Decoder: Generates output text from encoded vectors.
- Self-attention: Captures dependencies between words in text.
Extensions:
- BERT: Encoder-based, optimal for classification tasks.
- GPT: Decoder-based, ideal for generating text predictions.

GPT Architecture

Initially introduced in "Improving Language Understanding by Generative Pre-Training".
Expanded with GPT-3, significantly increasing layers (96) and parameters (175 billion).
Demonstrated emergent behaviors like effective translation without explicit training.

Datasets

GPT-3 trained using various large datasets, costing approximately $4.6 million.
Relevant datasets:
- EleutherAI
- AllenAI Dolma

Building an LLM

Blueprint:

Stage 1: Data & Architecture
1. Data Preparation
2. Attention Mechanism
3. Architecture Selection
Stage 2: Pre-training
1. Training Loop
2. Evaluation
3. Load Pretrained Weights
Stage 3: Fine-tuning
- Classifier or Personal Assistant Training

Chapter 2: Working with Text Data

IDE Tools

nvim:
```
wsl
nvim <filename>
```
- Needs PyTorch/autocompletion setup.
TorchStudio: User-friendly, may require further tutorials.

Embeddings

Convert unstructured data into continuous vectors.
Types: Word embeddings (Word2Vec), paragraph embeddings.
Dimensionality: GPT-3 uses 12,288-dimensional embeddings.

Tokenization

Simple Regex Tokenizer:

preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)

Simple Vocabulary Builder:
- Maps words to unique integers for token encoding.
Contextual Tokens:
- Special tokens (<|endoftext|>, <unk>, [PAD]) enhance model capability.

Byte-Pair Encoding (BPE):

Used by GPT models for efficient tokenization.

Python implementation:

import tiktoken
tokenizer = tiktoken.get_encoding("gpt-2")
tokenizer.encode("Example text")

Data Sampling

Create input-target pairs:

tokenizer = tiktoken.get_encoding("gpt2")
encoded_text = tokenizer.encode(read_file("the-verdict.txt"))

PyTorch efficient data handling:

class GPTDataset(Dataset):
    def __init__(self, txt, tokenizer, max_length, stride):
        # Initialization logic

Next Steps

Explore BPE in depth (Hugging Face Course).
Experiment with data sampling parameters for optimal performance.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
basic-tokenizer.py		basic-tokenizer.py
tensor-tokenizer.py		tensor-tokenizer.py
tiktoken-tokenizer.py		tiktoken-tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

My GPT LLM

Contents

Chapter 1: Understanding LLMs

Steps for Training

Transformer Architecture

GPT Architecture

Datasets

Building an LLM

Chapter 2: Working with Text Data

IDE Tools

Embeddings

Tokenization

Data Sampling

Next Steps

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

My GPT LLM

Contents

Chapter 1: Understanding LLMs

Steps for Training

Transformer Architecture

GPT Architecture

Datasets

Building an LLM

Chapter 2: Working with Text Data

IDE Tools

Embeddings

Tokenization

Data Sampling

Next Steps

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages