Skip to content

A collection of Natural Language Processing (NLP) tools and examples demonstrating various language processing techniques including context-free grammars, n-gram analysis, Markov chain text generation, sentiment analysis, and word vector operations.

Notifications You must be signed in to change notification settings

ealbertoav/lenguage

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Processing Toolkit

A collection of Natural Language Processing (NLP) tools and examples demonstrating various language processing techniques including context-free grammars, n-gram analysis, Markov chain text generation, sentiment analysis, and word vector operations.

Features

  • Context-Free Grammar (CFG) Parsing: Parse sentences using custom grammar rules and visualize parse trees
  • N-gram Analysis: Extract and analyze n-gram frequencies from text corpora
  • Markov Chain Text Generation: Generate new text based on training data using Markov chains
  • Sentiment Analysis: Classify text as positive or negative using Naive Bayes classification
  • Word Vectors: Work with word embeddings and find semantically similar words using cosine distance

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

Setup

  1. Clone the repository:

    git clone <repository-url>
    cd lenguage
  2. Install dependencies:

    pip install -r requirements.txt
  3. Download required NLTK data (for tokenization):

    import nltk
    nltk.download('punkt')

Usage

Context-Free Grammar Parsing

Parse sentences using predefined grammar rules:

# Basic grammar (simple sentences)
python cfg/cfg0.py
# Enter: she saw the city

# Extended grammar (with adjectives and prepositions)
python cfg/cfg1.py
# Enter: the big blue dog saw a car

The parser will display a visual parse tree for valid sentences.

N-gram Analysis

Analyze n-gram frequencies in a corpus of documents:

python ngrams/ngrams.py <n> <corpus_directory>

# Example: Find the 10 most common bigrams in Sherlock Holmes stories
python ngrams/ngrams.py 2 ngrams/holmes

# Example: Find the 10 most common trigrams
python ngrams/ngrams.py 3 ngrams/holmes

Markov Chain Text Generation

Generate new text based on a training corpus:

python markov/generator.py <text_file>

# Example: Generate Shakespeare-like text
python markov/generator.py markov/shakespeare.txt

This will output 5 randomly generated sentences based on the training text.

Sentiment Analysis

Classify text as positive or negative:

python sentiment/sentiment.py <corpus_directory>

# Example: Using the provided corpus
python sentiment/sentiment.py sentiment/corpus
# Enter a sentence when prompted to classify it

Word Vectors

Work with word embeddings in Python:

import os
os.chdir('vectors')
from vectors import words, closest_words, distance

# Find words closest to a given word
closest_words(words["king"])

# Calculate semantic distance between words
distance(words["king"], words["queen"])

# Word arithmetic (e.g., king - man + woman ≈ queen)
result = words["king"] - words["man"] + words["woman"]
closest_words(result)

Project Structure

lenguage/
├── cfg/                    # Context-Free Grammar examples
│   ├── cfg0.py            # Basic grammar parser
│   └── cfg1.py            # Extended grammar with adjectives/prepositions
├── markov/                 # Markov chain text generation
│   ├── generator.py       # Text generator script
│   └── shakespeare.txt    # Sample training corpus (Shakespeare works)
├── ngrams/                 # N-gram analysis
│   ├── ngrams.py          # N-gram frequency analyzer
│   └── holmes/            # Sample corpus (Sherlock Holmes stories)
├── sentiment/              # Sentiment analysis
│   ├── sentiment.py       # Naive Bayes sentiment classifier
│   └── corpus/            # Training data (positives.txt, negatives.txt)
├── vectors/                # Word vector operations
│   ├── vectors.py         # Word embedding utilities
│   └── words.txt          # Pre-trained word vectors
├── requirements.txt        # Python dependencies
└── README.md              # This file

Requirements

Package Version
nltk 3.9.2
markovify 0.9.4
scipy 1.16.3
numpy 2.4.0

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/new-feature)
  3. Commit your changes (git commit -am 'Add new feature')
  4. Push to the branch (git push origin feature/new-feature)
  5. Open a Pull Request

License

This project is provided for educational purposes. The Shakespeare text is from Project Gutenberg. The Sherlock Holmes stories are in the public domain.

About

A collection of Natural Language Processing (NLP) tools and examples demonstrating various language processing techniques including context-free grammars, n-gram analysis, Markov chain text generation, sentiment analysis, and word vector operations.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages