Language Processing Toolkit

A collection of Natural Language Processing (NLP) tools and examples demonstrating various language processing techniques including context-free grammars, n-gram analysis, Markov chain text generation, sentiment analysis, and word vector operations.

Features

Context-Free Grammar (CFG) Parsing: Parse sentences using custom grammar rules and visualize parse trees
N-gram Analysis: Extract and analyze n-gram frequencies from text corpora
Markov Chain Text Generation: Generate new text based on training data using Markov chains
Sentiment Analysis: Classify text as positive or negative using Naive Bayes classification
Word Vectors: Work with word embeddings and find semantically similar words using cosine distance

Installation

Prerequisites

Python 3.8 or higher
pip (Python package manager)

Setup

Clone the repository:
```
git clone <repository-url>
cd lenguage
```
Install dependencies:
```
pip install -r requirements.txt
```
Download required NLTK data (for tokenization):
```
import nltk
nltk.download('punkt')
```

Usage

Context-Free Grammar Parsing

Parse sentences using predefined grammar rules:

# Basic grammar (simple sentences)
python cfg/cfg0.py
# Enter: she saw the city

# Extended grammar (with adjectives and prepositions)
python cfg/cfg1.py
# Enter: the big blue dog saw a car

The parser will display a visual parse tree for valid sentences.

N-gram Analysis

Analyze n-gram frequencies in a corpus of documents:

python ngrams/ngrams.py <n> <corpus_directory>

# Example: Find the 10 most common bigrams in Sherlock Holmes stories
python ngrams/ngrams.py 2 ngrams/holmes

# Example: Find the 10 most common trigrams
python ngrams/ngrams.py 3 ngrams/holmes

Markov Chain Text Generation

Generate new text based on a training corpus:

python markov/generator.py <text_file>

# Example: Generate Shakespeare-like text
python markov/generator.py markov/shakespeare.txt

This will output 5 randomly generated sentences based on the training text.

Sentiment Analysis

Classify text as positive or negative:

python sentiment/sentiment.py <corpus_directory>

# Example: Using the provided corpus
python sentiment/sentiment.py sentiment/corpus
# Enter a sentence when prompted to classify it

Word Vectors

Work with word embeddings in Python:

import os
os.chdir('vectors')
from vectors import words, closest_words, distance

# Find words closest to a given word
closest_words(words["king"])

# Calculate semantic distance between words
distance(words["king"], words["queen"])

# Word arithmetic (e.g., king - man + woman ≈ queen)
result = words["king"] - words["man"] + words["woman"]
closest_words(result)

Project Structure

lenguage/
├── cfg/                    # Context-Free Grammar examples
│   ├── cfg0.py            # Basic grammar parser
│   └── cfg1.py            # Extended grammar with adjectives/prepositions
├── markov/                 # Markov chain text generation
│   ├── generator.py       # Text generator script
│   └── shakespeare.txt    # Sample training corpus (Shakespeare works)
├── ngrams/                 # N-gram analysis
│   ├── ngrams.py          # N-gram frequency analyzer
│   └── holmes/            # Sample corpus (Sherlock Holmes stories)
├── sentiment/              # Sentiment analysis
│   ├── sentiment.py       # Naive Bayes sentiment classifier
│   └── corpus/            # Training data (positives.txt, negatives.txt)
├── vectors/                # Word vector operations
│   ├── vectors.py         # Word embedding utilities
│   └── words.txt          # Pre-trained word vectors
├── requirements.txt        # Python dependencies
└── README.md              # This file

Requirements

Package	Version
nltk	3.9.2
markovify	0.9.4
scipy	1.16.3
numpy	2.4.0

Contributing

Contributions are welcome! To contribute:

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Commit your changes (git commit -am 'Add new feature')
Push to the branch (git push origin feature/new-feature)
Open a Pull Request

License

This project is provided for educational purposes. The Shakespeare text is from Project Gutenberg. The Sherlock Holmes stories are in the public domain.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Processing Toolkit

Features

Installation

Prerequisites

Setup

Usage

Context-Free Grammar Parsing

N-gram Analysis

Markov Chain Text Generation

Sentiment Analysis

Word Vectors

Project Structure

Requirements

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cfg		cfg
markov		markov
ngrams		ngrams
sentiment		sentiment
vectors		vectors
README.md		README.md
requirements.txt		requirements.txt

ealbertoav/lenguage

Folders and files

Latest commit

History

Repository files navigation

Language Processing Toolkit

Features

Installation

Prerequisites

Setup

Usage

Context-Free Grammar Parsing

N-gram Analysis

Markov Chain Text Generation

Sentiment Analysis

Word Vectors

Project Structure

Requirements

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages