This project is aimed at implementing a GPT (Generative Pre-trained Transformer) model from scratch. This endeavor is driven by the desire to understand the intricacies of transformer architectures, natural language processing, and pre-training mechanisms.
The main focus is to build a basic GPT model, for educational purposes, capable of generating text from a provided knowledge base (text files).
When training a generative model one must define a task for the model to train on. Depending on the model's purpose, a different task can be provided for capturing different aspects of semantic or probabilistic relationships between the sequence's elements. Language Modeling (LM) is a type of a self-supervised task that predicts the next element of a sequence based on the previos ones.
The unidirectinal LM model in this repository was constructed using an n-gram-based idea (originally a trigram type) and the whole process of filling the empty n-gram model was enganced with MapReduce algorithm. The model itself has a simple encoder and decoder and uses greedy search, temperature, and two sampling techniques (top-k sampling and top-p sampling) plus random sampling.
The input message length is restricted by PROMPT_LIMIT variable and it is best to keep this value low if a relatively small dataset is provided for training.
As an addition a typewriter function was introduced to simulate gradual text generation.
Folders description:
inputs - txt input files used to generate the initial corpus
outputs - contains corpus.txt - a combined text consisting of all input files
models - models saved after training
utils - contains constants.py - a file with all constants and folder names
media - folder with movies (presentations)


