baselineGPT

This project is aimed at implementing a GPT (Generative Pre-trained Transformer) model from scratch. This endeavor is driven by the desire to understand the intricacies of transformer architectures, natural language processing, and pre-training mechanisms.

The main focus is to build a basic GPT model, for educational purposes, capable of generating text from a provided knowledge base (text files).

N-gram based dummy model

When training a generative model one must define a task for the model to train on. Depending on the model's purpose, a different task can be provided for capturing different aspects of semantic or probabilistic relationships between the sequence's elements. Language Modeling (LM) is a type of a self-supervised task that predicts the next element of a sequence based on the previos ones.

The unidirectinal LM model in this repository was constructed using an n-gram-based idea (originally a trigram type) and the whole process of filling the empty n-gram model was enganced with MapReduce algorithm. The model itself has a simple encoder and decoder and uses greedy search, temperature, and two sampling techniques (top-k sampling and top-p sampling) plus random sampling.

The input message length is restricted by PROMPT_LIMIT variable and it is best to keep this value low if a relatively small dataset is provided for training.

As an addition a typewriter function was introduced to simulate gradual text generation.

Folders description: inputs - txt input files used to generate the initial corpus

outputs - contains corpus.txt - a combined text consisting of all input files

models - models saved after training

utils - contains constants.py - a file with all constants and folder names

media - folder with movies (presentations)

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
inputs		inputs
media		media
outputs		outputs
utils		utils
.gitignore		.gitignore
README.md		README.md
blueprint.py		blueprint.py
dummy_ngrams.ipynb		dummy_ngrams.ipynb
initial-dev-simple.py		initial-dev-simple.py
initial-dev.ipynb		initial-dev.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

baselineGPT

N-gram based dummy model

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

baselineGPT

N-gram based dummy model

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages