A collection of Natural Language Processing (NLP) tools and examples demonstrating various language processing techniques including context-free grammars, n-gram analysis, Markov chain text generation, sentiment analysis, and word vector operations.
- Context-Free Grammar (CFG) Parsing: Parse sentences using custom grammar rules and visualize parse trees
- N-gram Analysis: Extract and analyze n-gram frequencies from text corpora
- Markov Chain Text Generation: Generate new text based on training data using Markov chains
- Sentiment Analysis: Classify text as positive or negative using Naive Bayes classification
- Word Vectors: Work with word embeddings and find semantically similar words using cosine distance
- Python 3.8 or higher
- pip (Python package manager)
-
Clone the repository:
git clone <repository-url> cd lenguage
-
Install dependencies:
pip install -r requirements.txt
-
Download required NLTK data (for tokenization):
import nltk nltk.download('punkt')
Parse sentences using predefined grammar rules:
# Basic grammar (simple sentences)
python cfg/cfg0.py
# Enter: she saw the city
# Extended grammar (with adjectives and prepositions)
python cfg/cfg1.py
# Enter: the big blue dog saw a carThe parser will display a visual parse tree for valid sentences.
Analyze n-gram frequencies in a corpus of documents:
python ngrams/ngrams.py <n> <corpus_directory>
# Example: Find the 10 most common bigrams in Sherlock Holmes stories
python ngrams/ngrams.py 2 ngrams/holmes
# Example: Find the 10 most common trigrams
python ngrams/ngrams.py 3 ngrams/holmesGenerate new text based on a training corpus:
python markov/generator.py <text_file>
# Example: Generate Shakespeare-like text
python markov/generator.py markov/shakespeare.txtThis will output 5 randomly generated sentences based on the training text.
Classify text as positive or negative:
python sentiment/sentiment.py <corpus_directory>
# Example: Using the provided corpus
python sentiment/sentiment.py sentiment/corpus
# Enter a sentence when prompted to classify itWork with word embeddings in Python:
import os
os.chdir('vectors')
from vectors import words, closest_words, distance
# Find words closest to a given word
closest_words(words["king"])
# Calculate semantic distance between words
distance(words["king"], words["queen"])
# Word arithmetic (e.g., king - man + woman ≈ queen)
result = words["king"] - words["man"] + words["woman"]
closest_words(result)lenguage/
├── cfg/ # Context-Free Grammar examples
│ ├── cfg0.py # Basic grammar parser
│ └── cfg1.py # Extended grammar with adjectives/prepositions
├── markov/ # Markov chain text generation
│ ├── generator.py # Text generator script
│ └── shakespeare.txt # Sample training corpus (Shakespeare works)
├── ngrams/ # N-gram analysis
│ ├── ngrams.py # N-gram frequency analyzer
│ └── holmes/ # Sample corpus (Sherlock Holmes stories)
├── sentiment/ # Sentiment analysis
│ ├── sentiment.py # Naive Bayes sentiment classifier
│ └── corpus/ # Training data (positives.txt, negatives.txt)
├── vectors/ # Word vector operations
│ ├── vectors.py # Word embedding utilities
│ └── words.txt # Pre-trained word vectors
├── requirements.txt # Python dependencies
└── README.md # This file
| Package | Version |
|---|---|
| nltk | 3.9.2 |
| markovify | 0.9.4 |
| scipy | 1.16.3 |
| numpy | 2.4.0 |
Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/new-feature) - Commit your changes (
git commit -am 'Add new feature') - Push to the branch (
git push origin feature/new-feature) - Open a Pull Request
This project is provided for educational purposes. The Shakespeare text is from Project Gutenberg. The Sherlock Holmes stories are in the public domain.