diff --git a/experiment/aim.md b/experiment/aim.md index 063e045..5d4e4ae 100644 --- a/experiment/aim.md +++ b/experiment/aim.md @@ -1,12 +1,3 @@ -To understand and demonstrate the application of the Viterbi algorithm for Part-of-Speech (POS) tagging in Natural Language Processing. This experiment provides hands-on experience with the Viterbi decoding process, which is a fundamental dynamic programming algorithm used to find the most likely sequence of hidden states (POS tags) given observable sequences (words) in Hidden Markov Models. +**To understand and practice sequence decoding for Part-of-Speech (POS) tagging using the Viterbi algorithm in Natural Language Processing.** -The Viterbi algorithm is crucial in statistical NLP for solving the decoding problem: given a sequence of words and pre-computed emission and transition probabilities from a training corpus, determine the most probable sequence of POS tags that generated those words. This experiment allows learners to practice filling Viterbi tables step-by-step and understand how dynamic programming efficiently finds optimal tag sequences. - -For example, given the sentence "Book a park", the algorithm determines whether "Book" should be tagged as a noun or verb, considering both: - -- **Emission probabilities**: How likely each word is to be generated by each POS tag -- **Transition probabilities**: How likely each POS tag is to follow another in sequence - -Through interactive simulation, learners will master the mathematical foundations of the Viterbi algorithm and its practical application in modern POS tagging systems. - -Viterbi Decoding Animation +This experiment aims to help students develop proficiency in applying the Viterbi algorithm to find the most probable sequence of POS tags for a given sentence, using emission and transition probabilities. Through interactive exercises, learners will gain hands-on experience with dynamic programming and sequence labeling in NLP. diff --git a/experiment/assignment.md b/experiment/assignment.md index e12b0d4..281461b 100644 --- a/experiment/assignment.md +++ b/experiment/assignment.md @@ -14,21 +14,21 @@ **Emission Matrix P(word|tag):** -``` +
          The    dog    runs
 Noun     0.1    0.6    0.1
 Verb     0.0    0.1    0.8
 Det      0.9    0.0    0.0
-```
+
**Transition Matrix P(tag_j|tag_i):** -``` +
          Noun   Verb   Det
 Noun     0.3    0.4    0.1
 Verb     0.4    0.1    0.2
 Det      0.7    0.2    0.1
-```
+
Assume equal initial probabilities π[tag] = 1/3 for all tags. diff --git a/experiment/extended-study.md b/experiment/extended-study.md index fbd93a7..42b27ae 100644 --- a/experiment/extended-study.md +++ b/experiment/extended-study.md @@ -1,490 +1,97 @@ -### Advanced Topics in Viterbi Decoding and Dynamic Programming +### Advanced Topics in Viterbi Decoding -### 1. Mathematical Foundations of Viterbi Algorithm +#### 1. Sequence Decoding Techniques -**Dynamic Programming Principles:** +- **Viterbi Algorithm**: Study the dynamic programming approach for finding the most probable sequence of hidden states (POS tags) in Hidden Markov Models. +- **Forward-Backward Algorithm**: Learn about parameter estimation and marginal probabilities in HMMs. +- **Beam Search and Approximations**: Explore faster, memory-efficient alternatives to full Viterbi decoding. -The Viterbi algorithm exemplifies dynamic programming with two key properties: +#### 2. Applications Across Domains -- **Optimal Substructure:** The optimal solution contains optimal solutions to subproblems -- **Overlapping Subproblems:** The same subproblems are solved multiple times +- Speech recognition and error correction +- Bioinformatics (gene/protein sequence analysis) +- Financial modeling and time series analysis +- Named Entity Recognition and Information Extraction -**Mathematical Formulation:** +#### 3. Computational Implementation -For a sequence of words w₁, w₂, ..., wₙ and tags t₁, t₂, ..., tₘ: +- Efficient storage and computation for large tagsets +- Log-space computation for numerical stability +- Handling data sparsity and smoothing techniques -
-V[i,j] = max(V[k,j-1] × P(tᵢ|tₖ)) × P(wⱼ|tᵢ)
-         k
-
+#### 4. Research Papers -Where: +1. "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition" (Rabiner, 1989) +2. "The Viterbi Algorithm" by G.D. Forney Jr. (1973) +3. "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" (Ma & Hovy) -- V[i,j] = maximum probability of any tag sequence ending in tag i at position j -- P(tᵢ|tₖ) = transition probability from tag k to tag i -- P(wⱼ|tᵢ) = emission probability of word j given tag i +#### 5. Online Resources -**Complexity Analysis:** +1. **Video Lectures** -- **Time Complexity:** O(N × T²) where N = sentence length, T = number of tags -- **Space Complexity:** O(N × T) for the Viterbi table -- **Without Dynamic Programming:** O(T^N) - exponentially worse! + - Stanford CS224N: Sequence Models and HMMs + - NPTEL: Hidden Markov Models in NLP + - Coursera: Sequence Models in NLP -### 2. Advanced Viterbi Implementations +2. **Interactive Tools** -**Numerical Stability:** + - Online HMM POS Taggers + - Viterbi algorithm visualizers + - Sequence labeling simulators -Real implementations must handle extremely small probabilities: +3. **Code Repositories** + - Open-source HMM and Viterbi implementations (Python, Java) + - Sequence labeling datasets + - Tutorials for building POS taggers -**Log-Space Computation:** +#### 6. Practical Exercises -
-log V[i,j] = max(log V[k,j-1] + log P(tᵢ|tₖ)) + log P(wⱼ|tᵢ)
-              k
-
+1. **Basic Exercises** -**Advantages of Log-Space:** + - Implement a simple Viterbi POS tagger + - Calculate emission and transition probabilities + - Visualize state transitions in Markov chains -- Avoids numerical underflow -- Converts multiplications to additions -- More computationally stable +2. **Advanced Projects** -**Memory Optimization:** + - Build a domain-adapted POS tagger + - Compare Viterbi with neural sequence models + - Analyze tagging errors and confusion matrices -- **Online Algorithm:** Only store previous column, not entire table -- **Beam Search:** Keep only top-K paths instead of all paths -- **Sparse Representations:** Skip impossible transitions +3. **Research Projects** + - Study the impact of smoothing on tagging accuracy + - Explore multilingual POS tagging with HMMs + - Integrate morphological features into sequence models -**Parallel Computation:** +#### 7. Further Reading -- Each cell in a column can be computed independently -- GPU implementations can process thousands of words simultaneously -- SIMD instructions optimize matrix operations +##### Books -### 3. Variants of the Viterbi Algorithm +1. "Speech and Language Processing" by Jurafsky & Martin (Chapters on HMMs and Viterbi) +2. "Pattern Recognition and Machine Learning" by Bishop (Sequence models section) +3. "Foundations of Statistical Natural Language Processing" by Manning & Schütze -**Forward-Backward Algorithm:** +##### Journals -Unlike Viterbi (which finds the single best path), Forward-Backward computes: +1. Computational Linguistics +2. Natural Language Engineering +3. Journal of Machine Learning Research -- **Forward:** Probability of observation sequence up to time t -- **Backward:** Probability of observation sequence from time t+1 onwards -- **Purpose:** Parameter estimation and computing marginal probabilities +#### 8. Tools and Software -**Viterbi vs. Forward-Backward:** +1. **Analysis Tools** -- Viterbi: "What's the best tag sequence?" -- Forward-Backward: "What's the probability of each tag at each position?" + - NLTK HMM Tagger + - Stanford POS Tagger + - spaCy sequence labeling modules -**Beam Search Approximation:** +2. **Development Frameworks** -- Keep only the top-B best paths at each step -- Trades accuracy for speed and memory -- Essential for very large tag sets or long sequences + - scikit-learn HMM modules + - CRF++ toolkit + - TensorFlow/Keras for neural sequence models -**Constrained Viterbi:** - -- Add external constraints (e.g., named entity boundaries) -- Force certain tags at specific positions -- Useful for semi-supervised learning - -### 4. Viterbi in Other Domains - -**Speech Recognition:** - -- **Observation:** Acoustic features (MFCCs, spectrograms) -- **Hidden States:** Phonemes or words -- **Challenge:** Continuous observations require Gaussian mixture models - -**Bioinformatics Applications:** - -- **Gene Prediction:** Find protein-coding regions in DNA -- **Sequence Alignment:** Align biological sequences optimally -- **Hidden States:** Exon, intron, non-coding regions - -**Part-of-Speech vs. Gene Prediction:** - -
-POS:  [Noun] [Verb] [Det] [Noun]
-      "Cat"  "ate"  "the" "fish"
-
-Gene: [Exon] [Intron] [Exon] [Stop]
-      ATGC   GTAAGT    CGTT   TAG
-
- -**Financial Modeling:** - -- **Hidden States:** Market regimes (bull, bear, volatile) -- **Observations:** Price movements, trading volumes -- **Applications:** Algorithmic trading, risk management - -### 5. Modern Alternatives to Viterbi - -**Neural Sequence Models:** - -**CRF (Conditional Random Fields):** - -- Discriminative models vs. HMM's generative approach -- Can incorporate overlapping features -- Still use Viterbi for inference! - -**LSTM-CRF Models:** - -- LSTM encodes sequence context -- CRF layer ensures valid tag transitions -- Viterbi decoding finds optimal path - -**Transformer Models:** - -- Self-attention mechanisms -- Can process entire sequence simultaneously -- Often use greedy decoding instead of Viterbi - -**When Viterbi Still Matters:** - -- Neural models often use Viterbi in final layer -- Structured prediction requires path optimization -- Interpretability and guaranteed optimality - -**When to Use HMMs:** - -- Limited computational resources -- Need for model interpretability -- Educational purposes -- Quick prototyping - -### 6. Debugging and Optimizing Viterbi - -**Common Implementation Errors:** - -**Probability Underflow:** - -- Problem: Probabilities become too small (approach 0) -- Solution: Use log-space computation -- Detection: Results become NaN or infinite - -**Incorrect Backtracking:** - -- Problem: Path reconstruction gives wrong sequence -- Solution: Verify pointer array construction -- Testing: Compare with ground truth on small examples - -**Matrix Indexing Errors:** - -- Problem: Off-by-one errors in array access -- Solution: Consistent 0-based or 1-based indexing -- Prevention: Unit tests for each function - -**Performance Optimization:** - -**Memory Access Patterns:** - -- Store matrices in row-major or column-major order -- Optimize cache usage for large vocabularies -- Use sparse matrices for limited tag sets - -**Vectorization:** - -- Use SIMD instructions for parallel computation -- NumPy/BLAS operations for matrix multiplication -- GPU kernels for massive parallelization - -**Profiling Tips:** - -- Measure actual bottlenecks, not assumed ones -- Profile on realistic data sizes -- Consider both time and memory usage - -### 7. Advanced Viterbi Extensions - -**Higher-Order Models:** - -**Second-Order Viterbi:** - -- Consider two previous tags: P(tag₃|tag₁, tag₂) -- Complexity increases to O(N × T³) -- Better linguistic modeling at computational cost - -**Maximum Entropy Markov Models:** - -- Combine Viterbi with feature-based models -- Can incorporate arbitrary features -- More flexible than pure HMMs - -**Semi-CRF Models:** - -- Segments of variable length -- Each segment has a single label -- Applications: Named entity recognition, chunking - -**Approximate Viterbi Methods:** - -**Pruning Strategies:** - -- Beam search: Keep top-K candidates -- Threshold pruning: Discard low-probability paths -- Forward-backward pruning: Use forward probabilities to guide search - -**Hierarchical Decoding:** - -- First pass: Coarse tag categories -- Second pass: Fine-grained tags within categories -- Reduces computational complexity -- Consistent POS tag definitions -- Enables cross-lingual model development - -**Language-Specific Considerations:** - -- **Agglutinative Languages:** Complex morphology requires sub-word analysis -- **Isolating Languages:** Fewer morphological variations -- **Fusional Languages:** Multiple grammatical features per word - -### 8. Practical Viterbi Implementation - -**Data Structures:** - -**Viterbi Table Storage:** - -
-# 2D array: viterbi[tag][position]
-viterbi = [[0.0] * sentence_length for _ in range(num_tags)]
-
-# Backpointer array for path reconstruction
-backpointer = [[0] * sentence_length for _ in range(num_tags)]
-
- -**Memory-Efficient Implementation:** - -
-# Only store current and previous columns
-current_column = [0.0] * num_tags
-previous_column = [0.0] * num_tags
-
- -**Handling Edge Cases:** - -**Zero Probabilities:** - -- Replace with small epsilon value (e.g., 1e-10) -- Use smoothing for unseen word-tag combinations -- Graceful degradation for OOV words - -**Sentence Boundaries:** - -- Special START and END tokens -- Initialize first column with start probabilities -- Terminate at END token - -**Efficiency Considerations:** - -**Sparse Matrices:** - -- Many transition probabilities are zero -- Use compressed sparse row (CSR) format -- Skip impossible transitions during computation - -**Parallel Processing:** - -- Each tag in a column can be computed independently -- Multi-threading for large vocabularies -- GPU implementations for massive datasets - -### 9. Research and Applications - -**Current Research Areas:** - -**Neural-Symbolic Integration:** - -- Combining neural networks with Viterbi inference -- Differentiable dynamic programming -- End-to-end learning with structured output - -**Structured Attention:** - -- Attention mechanisms that mimic Viterbi paths -- Soft vs. hard alignment in sequence models -- Interpretable neural sequence models - -**Online Learning:** - -- Updating Viterbi models with streaming data -- Incremental parameter estimation -- Concept drift adaptation - -**Emerging Applications:** - -**Computational Biology:** - -- Protein structure prediction -- Gene regulatory network inference -- Phylogenetic analysis using HMMs - -**Signal Processing:** - -- Speech enhancement and denoising -- Gesture recognition from sensor data -- Financial time series analysis - -**Computer Vision:** - -- Object tracking in video sequences -- Action recognition in temporal data -- Medical image sequence analysis - -### 10. Hands-on Viterbi Projects - -**Beginner Projects:** - -1. **Pure Viterbi Implementation** - - - Code the algorithm from scratch in Python - - Implement both probability and log-space versions - - Test on the experiment's corpus data - -2. **Viterbi Visualization** - - - Create animated visualizations of table filling - - Show path probability evolution - - Highlight optimal path discovery - -3. **Performance Analysis** - - Compare execution times for different sentence lengths - - Measure memory usage growth - - Analyze complexity empirically - -**Intermediate Projects:** - -1. **Multi-Domain Viterbi** - - - Build taggers for different text domains - - Compare transition matrix patterns - - Implement domain adaptation techniques - -2. **Approximate Viterbi** - - - Implement beam search variants - - Compare accuracy vs. speed trade-offs - - Analyze when approximations fail - -3. **Parallel Viterbi** - - Multi-threaded implementation - - GPU acceleration using CUDA/OpenCL - - Benchmark parallel efficiency - -**Advanced Projects:** - -1. **Neural-Viterbi Hybrid** - - - Use neural networks for emission probabilities - - Keep Viterbi for structured inference - - Compare with end-to-end neural models - -2. **Structured Perceptron with Viterbi** - - - Implement discriminative training - - Use Viterbi for loss-augmented inference - - Compare with CRF models - -3. **Real-Time Viterbi System** - - Build streaming POS tagger - - Handle partial observations - - Optimize for low latency - -### 11. Resources for Further Learning - -**Core Algorithms and Theory:** - -**Essential Papers:** - -- "The Viterbi Algorithm" by G.D. Forney Jr. (1973) - Original IEEE paper -- "A Tutorial on Hidden Markov Models and Selected Applications" by Rabiner (1989) -- "Dynamic Programming and the Viterbi Algorithm" by Viterbi (1967) - -**Textbooks:** - -- "Introduction to Algorithms" by Cormen et al. - Dynamic Programming chapter -- "Speech and Language Processing" by Jurafsky & Martin - HMM and Viterbi sections -- "Pattern Recognition and Machine Learning" by Bishop - Sequence models - -**Advanced Topics:** - -**Structured Prediction:** - -- "Structured Prediction Models via the Matrix-Tree Theorem" by Koo et al. -- "Discriminative Training Methods for Hidden Markov Models" by Povey & Woodland - -**Modern Applications:** - -- "Neural Architectures for Named Entity Recognition" by Lample et al. -- "End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF" by Ma & Hovy - -**Implementation Resources:** - -**Programming Libraries:** - -- **Python:** NLTK, scikit-learn, TensorFlow Probability -- **Java:** OpenNLP, Stanford CoreNLP -- **C++:** HTK, Julius (speech recognition) -- **R:** HMM package, RHmm - -**Datasets for Practice:** - -- Penn Treebank (English POS tagging) -- Universal Dependencies (multilingual) -- CoNLL shared tasks (various sequence labeling tasks) - -**Online Tutorials:** - -- Interactive Viterbi visualization: https://web.stanford.edu/~jurafsky/slp3/ -- Dynamic programming tutorials with Viterbi examples -- YouTube lectures on HMMs and dynamic programming - -### 12. Career Applications - -**Industry Roles Utilizing Viterbi:** - -**Algorithm Engineer:** - -- Implementing efficient Viterbi variants for production systems -- Optimizing dynamic programming algorithms for specific hardware -- Developing domain-specific sequence models - -**Machine Learning Engineer:** - -- Integrating Viterbi into neural architectures -- Building hybrid statistical-neural models -- Optimizing inference pipelines for real-time applications - -**Research Scientist:** - -- Developing new structured prediction algorithms -- Exploring applications beyond NLP (biology, finance, robotics) -- Publishing on algorithmic innovations and theoretical advances - -**Application Domains:** - -**Healthcare:** - -- Electronic health record processing -- Medical image sequence analysis -- Drug discovery sequence modeling - -**Autonomous Systems:** - -- Robot navigation and path planning -- Sensor fusion for state estimation -- Behavior prediction in dynamic environments - -**Financial Technology:** - -- Algorithmic trading with regime detection -- Risk modeling with hidden state models -- Market sentiment analysis from text streams - -**Telecommunications:** - -- Error correction in digital communications -- Network state monitoring and optimization -- Speech compression and enhancement - -This extended study demonstrates how mastering the Viterbi algorithm opens doors to diverse applications across computer science and provides a solid foundation for understanding modern structured prediction methods in machine learning. +3. **Evaluation Tools** + - POS tagging accuracy metrics + - Confusion matrix generators + - Error analysis scripts diff --git a/experiment/glossary.md b/experiment/glossary.md index 29f70a7..1f592d5 100644 --- a/experiment/glossary.md +++ b/experiment/glossary.md @@ -1,131 +1,75 @@ -### A +### Core Viterbi Decoding and Sequence Labeling Terms -**Algorithm** - A step-by-step procedure for solving a problem or completing a task, such as the Viterbi algorithm for finding the most likely POS tag sequence. +**Viterbi Algorithm**: +A dynamic programming algorithm for finding the most probable sequence of hidden states (such as POS tags) in a Hidden Markov Model. -**Ambiguity** - The property of words that can belong to multiple grammatical categories depending on context (e.g., "run" can be a noun or verb). +**Hidden Markov Model (HMM)**: +A statistical model where the system being modeled is assumed to be a Markov process with unobserved (hidden) states. -### B +**Part-of-Speech (POS) Tagging**: +The process of assigning grammatical categories (noun, verb, adjective, etc.) to each word in a sentence. -**Backtracking** - The final step in the Viterbi algorithm where the most likely path is traced backwards to determine the complete POS tag sequence. +**Transition Probability**: +The probability of moving from one state (POS tag) to another in a sequence, e.g., P(VERB | NOUN). -**Beam Search** - An approximation to Viterbi that keeps only the top-K most probable paths at each step, trading accuracy for computational efficiency. +**Emission Probability**: +The probability of observing a word given a particular POS tag, e.g., P(dog | NOUN). -**Bigram** - A sequence of two adjacent elements, in HMM context referring to consecutive POS tags used in transition probabilities. +**Sequence Decoding**: +The process of finding the most likely sequence of hidden states given a sequence of observations. -### C +**Dynamic Programming**: +An algorithmic technique that solves complex problems by breaking them down into simpler subproblems, used in the Viterbi algorithm. -**Corpus** - A large collection of written or spoken texts used for linguistic analysis and training statistical models. +**Ambiguity**: +The property of words that can belong to multiple grammatical categories depending on context (e.g., "can" as a verb or noun). -**Conditional Probability** - The probability of an event occurring given that another event has occurred, fundamental to HMM calculations. +**Training Data**: +Annotated sentences used to learn the parameters of a statistical model. -### D +**Corpus**: +A large collection of written or spoken texts used for linguistic analysis and training statistical models. -**Decoding** - The process of finding the most likely sequence of hidden states (POS tags) given the observed sequence (words). +**N-gram**: +A contiguous sequence of n items (words or tags) from a given sequence of text or speech. -**Dynamic Programming** - An algorithmic technique that solves complex problems by breaking them down into simpler subproblems, used in the Viterbi algorithm. +**Unigram, Bigram, Trigram**: +A single word/tag, a sequence of two, or a sequence of three, respectively. -### E +**Smoothing**: +Techniques used to handle zero probabilities in statistical models by redistributing probability mass. -**Emission Probability** - The probability of observing a particular word given a specific POS tag, denoted as P(word|tag). +**Sequence Labeling**: +The task of assigning labels (such as POS tags) to elements in a sequence. -**End of Sentence (EOS)** - A special marker used to denote sentence boundaries in corpus annotation and HMM training. +**Observation**: +The visible outputs (words) generated by the hidden states (POS tags) in an HMM. -### F +**State**: +A condition or situation in a system, in HMM referring to the hidden grammatical categories. -**First-order Markov Model** - A model where the probability of the next state depends only on the current state, not on the entire history. +**Lexical Category**: +The grammatical class of a word (noun, verb, adjective, etc.). -**Forward Algorithm** - An algorithm for computing the probability of an observation sequence in an HMM. +**Word Tokenization**: +The process of breaking text into individual words or tokens. -### G +**Out-of-Vocabulary (OOV)**: +Words that appear in test data but were not seen during training. -**Grammar** - The set of structural rules governing the composition of clauses, phrases, and words in a language. +**Probability Matrix**: +A matrix containing probability values, such as transition or emission probabilities in an HMM. -**Grammatical Category** - A class of words that have similar grammatical properties (noun, verb, adjective, etc.). +**Statistical Model**: +A mathematical model that uses probability distributions to represent data and make predictions. -### H +**Decoding**: +The process of finding the most likely sequence of hidden states (POS tags) given the observed sequence (words). -**Hidden Markov Model (HMM)** - A statistical model where the system being modeled is assumed to be a Markov process with unobserved states. +**Maximum Likelihood Estimation**: +A method of estimating model parameters by finding values that maximize the likelihood of the observed data. -**Hidden States** - The unobserved states in an HMM, which in POS tagging correspond to the grammatical categories. +**Natural Language Processing (NLP)**: +A field of computer science and artificial intelligence concerned with interactions between computers and human language. -### I - -**Independence Assumption** - The assumption that the probability of observing a word depends only on its POS tag, not on other words or tags. - -**Initial State Distribution** - The probability distribution over the possible starting states in an HMM. - -### L - -**Lexical Category** - Another term for part-of-speech or grammatical category of a word. - -**Likelihood** - The probability of observing the given data under a particular model or set of parameters. - -**Log-Space Computation** - A numerical technique used in Viterbi algorithm to prevent underflow by working with logarithms of probabilities instead of probabilities themselves. - -### M - -**Markov Assumption** - The assumption that the probability of the next state depends only on the current state. - -**Markov Chain** - A mathematical system that undergoes transitions from one state to another according to certain probabilistic rules. - -**Maximum Likelihood Estimation** - A method of estimating model parameters by finding values that maximize the likelihood of the observed data. - -### N - -**Natural Language Processing (NLP)** - A field of computer science and artificial intelligence concerned with interactions between computers and human language. - -**N-gram** - A contiguous sequence of n items from a given sequence of text or speech. - -### O - -**Observation** - In HMM context, the visible outputs (words) that are generated by the hidden states (POS tags). - -**Optimal Substructure** - A key property of dynamic programming problems where optimal solutions contain optimal solutions to subproblems, enabling the Viterbi algorithm's efficiency. - -**Out-of-Vocabulary (OOV)** - Words that appear in test data but were not seen during training. - -### P - -**Part-of-Speech (POS)** - A category of words that have similar grammatical properties (noun, verb, adjective, etc.). - -**POS Tagging** - The process of assigning part-of-speech tags to words in a sentence. - -**Probability Matrix** - A matrix containing probability values, such as transition or emission probabilities in an HMM. - -**Pruning** - Optimization techniques in Viterbi algorithm that discard low-probability paths to reduce computational complexity. - -### S - -**Sequence Labeling** - The task of assigning labels to elements in a sequence, such as POS tags to words. - -**Smoothing** - Techniques used to handle zero probabilities in statistical models by redistributing probability mass. - -**State** - A condition or situation in a system, in HMM referring to the hidden grammatical categories. - -**Statistical Model** - A mathematical model that uses probability distributions to represent data and make predictions. - -### T - -**Transition Probability** - The probability of moving from one state to another in a sequence, denoted as P(tag₂|tag₁). - -**Training Data** - Annotated data used to learn the parameters of a statistical model. - -### U - -**Unigram** - A single word or token, used in calculating base probabilities for words. - -**Unsupervised Learning** - Machine learning where the algorithm learns patterns from data without labeled examples. - -### V - -**Viterbi Algorithm** - A dynamic programming algorithm for finding the most likely sequence of hidden states in an HMM. - -**Viterbi Table** - The matrix used to store intermediate probability calculations during Viterbi decoding, where each cell represents the maximum probability of any path ending at a specific state and time. - -**Vocabulary** - The set of all unique words in a corpus or dataset. - -### W - -**Word Sense Disambiguation** - The process of determining which meaning of a word is used in a particular context. - -**Word Tokenization** - The process of breaking text into individual words or tokens. +--- diff --git a/experiment/posttest.json b/experiment/posttest.json index 9019954..2eb10e7 100644 --- a/experiment/posttest.json +++ b/experiment/posttest.json @@ -19,7 +19,7 @@ "difficulty": "beginner" }, { - "question": "In the simulation, when filling the Viterbi table for the second word onwards, what is the correct formula for computing V[i][j]?", + "question": "In the simulation, when filling the Viterbi table for the second word onwards,
what is the correct formula for computing V[i][j]?
Choose the best option:", "answers": { "a": "V[i][j] = max_k(V[k][j-1] × transition[k][i]) × emission[i][word_j]", "b": "V[i][j] = V[i][j-1] × emission[i][word_j]", diff --git a/experiment/pretest.json b/experiment/pretest.json index 05d4d26..be77841 100644 --- a/experiment/pretest.json +++ b/experiment/pretest.json @@ -38,8 +38,8 @@ { "question": "What information does the emission matrix provide in HMM-based POS tagging?", "answers": { - "a": "P(word | POS tag) - the probability of observing a word given a POS tag", - "b": "P(POS tag | word) - the probability of a POS tag given a word", + "a": "P( word | POS tag ) - the probability of observing a word given a POS tag", + "b": "P( POS tag | word ) - the probability of a POS tag given a word", "c": "The frequency of words in the training corpus", "d": "The order of words in a sentence" }, @@ -53,10 +53,10 @@ "difficulty": "beginner" }, { - "question": "What does the transition matrix represent in the context of POS tagging?", + "question": "What does the transition matrix represent in the context of POS tagging?
Choose the best option:", "answers": { - "a": "P(tag_i | tag_i-1) - the probability of transitioning from one POS tag to another", - "b": "P(word_i | word_i-1) - the probability of one word following another", + "a": "P(tag_i | tag_{i-1}) - the probability of transitioning from one POS tag to another", + "b": "P(word_i | word_{i-1}) - the probability of one word following another", "c": "The total number of tags in the tagset", "d": "The length of the training sentences" }, @@ -70,7 +70,7 @@ "difficulty": "intermediate" }, { - "question": "What is the key computational technique that makes the Viterbi algorithm efficient?", + "question": "What is the key computational technique that makes the Viterbi algorithm efficient?
Choose the best option:", "answers": { "a": "Dynamic programming", "b": "Brute force search", @@ -87,7 +87,7 @@ "difficulty": "intermediate" }, { - "question": "In the Viterbi algorithm, what does each cell V[i][j] in the Viterbi table represent?", + "question": "In the Viterbi algorithm, what does each cell V[i][j] in the Viterbi table represent?
Choose the best option:", "answers": { "a": "The maximum probability of any tag sequence ending in tag i at word position j", "b": "The emission probability of word j given tag i", diff --git a/experiment/procedure.md b/experiment/procedure.md index 040165e..ba048db 100644 --- a/experiment/procedure.md +++ b/experiment/procedure.md @@ -66,15 +66,3 @@ Follow these steps to complete the Viterbi Decoding experiment: - Use **"Reset"** to clear your work and start over - Try different corpora to practice with various vocabulary and sentence structures - Focus on understanding the relationship between training data and decoding outcomes - -### Learning Tips - -- **Mathematical Understanding**: Focus on how each cell value is computed using dynamic programming -- **Linguistic Intuition**: Consider why certain tag sequences are more probable than others -- **Error Analysis**: When answers are incorrect, analyze whether the error was in: - - Emission probability lookup - - Transition probability lookup - - Mathematical computation - - Understanding of the algorithm - -This hands-on approach reinforces theoretical understanding of the Viterbi algorithm while providing practical experience with statistical POS tagging. diff --git a/experiment/theory.md b/experiment/theory.md index 1f802a0..ddbff42 100644 --- a/experiment/theory.md +++ b/experiment/theory.md @@ -1,181 +1,200 @@ -Part-of-Speech (POS) tagging is a fundamental sequence labeling task in Natural Language Processing that assigns grammatical categories to words in context. The **Viterbi algorithm** provides an elegant dynamic programming solution to find the most probable sequence of POS tags using Hidden Markov Models. +Sequence decoding is a fundamental process in computational linguistics and Natural Language Processing (NLP) that involves finding the most probable sequence of hidden states (such as POS tags) given a sequence of observed words. The Viterbi algorithm is the standard dynamic programming solution for this task, especially when using Hidden Markov Models (HMMs). --- -#### 1. Hidden Markov Models for POS Tagging +### What is Viterbi Decoding? -A Hidden Markov Model for POS tagging consists of: +Viterbi decoding is the computational process of determining the most likely sequence of hidden states (tags) for a given sequence of observations (words), using: -- **Hidden States (S)**: POS tags {Noun, Verb, Adjective, Determiner, ...} -- **Observable Symbols (O)**: Words in the vocabulary {the, cat, runs, quickly, ...} -- **Transition Probabilities (A)**: P(tag_j | tag_i) - likelihood of tag sequence -- **Emission Probabilities (B)**: P(word | tag) - likelihood of word given tag -- **Initial Probabilities (π)**: P(tag) - probability of starting with a tag +- **Emission probabilities**: Likelihood of a word given a tag +- **Transition probabilities**: Likelihood of a tag following another tag +- **Initial probabilities**: Likelihood of starting with a particular tag + +The algorithm efficiently computes the optimal path through a trellis (table) using dynamic programming. --- +### Core Components + +#### 1. Hidden Markov Model (HMM) + +An HMM for POS tagging consists of: + +- **Hidden States (Tags)**: e.g., Noun, Verb, Adjective, Determiner +- **Observations (Words)**: e.g., "the", "cat", "runs" +- **Transition Probabilities**: P(tag₂ | tag₁) +- **Emission Probabilities**: P(word | tag) +- **Initial Probabilities**: P(tag₁) + #### 2. The Decoding Problem -Given a sequence of words **W = w₁, w₂, ..., wₙ** and HMM parameters **(A, B, π)**, find the most likely tag sequence **T\* = t₁, t₂, ..., tₙ** such that: +Given a sequence of words **W = w₁, w₂, ..., wₙ** and HMM parameters, find the most likely tag sequence **T\*** = t₁, t₂, ..., tₙ: -**T\* = argmax P(T | W)** +
-Using Bayes' theorem and the Markov assumption: +$$ +T^* = \arg\max_{T} P(T \mid W) +$$ -**T\* = argmax ∏ᵢ₌₁ⁿ P(wᵢ | tᵢ) × P(tᵢ | tᵢ₋₁)** +
---- +Using the Markov and output independence assumptions, this becomes: -#### 3. Viterbi Algorithm: Dynamic Programming Solution +
-The Viterbi algorithm solves this optimization problem efficiently using dynamic programming principles. +$$ +T^* = \arg\max_{T} \prod_{i=1}^n P(w_i \mid t_i) \times P(t_i \mid t_{i-1}) +$$ -#### **Mathematical Foundation** +
+ +--- + +### Viterbi Algorithm: Dynamic Programming Solution + +#### Mathematical Foundation For each word position **j** and tag **s**, we compute: -**V[s][j] = max*{s'} (V[s'][j-1] × a*{s',s}) × b_s(wⱼ)** +
+ +$$ +V[s][j] = \max_{s'} \left( V[s'][j-1] \times a_{s',s} \right) \times b_s(w_j) +$$ + +
Where: - **V[s][j]**: Maximum probability of any tag sequence ending in state **s** at position **j** - **a\_{s',s}**: Transition probability from tag **s'** to tag **s** -- **b_s(wⱼ)**: Emission probability of word **wⱼ** given tag **s** +- **b_s(w_j)**: Emission probability of word **w_j** given tag **s** -#### **Algorithm Steps** +--- -##### 1. **Initialization** (j = 1) +#### Algorithm Steps -``` -V[s][1] = π[s] × b_s(w₁) -Path[s][1] = 0 -``` +#### 1. Initialization (j = 1) -##### 2. **Recursion** (j = 2 to N) +
-``` -For each state s: - V[s][j] = max_{s'} (V[s'][j-1] × a_{s',s}) × b_s(wⱼ) - Path[s][j] = argmax_{s'} (V[s'][j-1] × a_{s',s}) -``` +$$ +V[s][1] = \pi[s] \times b_s(w_1) +$$ -##### 3. **Termination** +
-``` -P* = max_s V[s][N] -q*_N = argmax_s V[s][N] -``` +#### 2. Recursion (j = 2 \ldots n) -##### 4. **Backtracking** (j = N-1 to 1) +For each state **s**: -``` -q*_j = Path[q*_{j+1}][j+1] -``` +
---- +$$ +V[s][j] = \max_{s'} \left( V[s'][j-1] \times a_{s',s} \right) \times b_s(w_j) +$$ -#### 4. Example Walkthrough +
-Consider tagging **"Book that flight"** with tags {Noun, Verb, Det}: +#### 3. Termination -#### **Probability Matrices** +
-**Emission Matrix P(word|tag):** +$$ +P^* = \max_s V[s][n] +$$ -``` - Book that flight -Noun 0.3 0.1 0.8 -Verb 0.7 0.0 0.1 -Det 0.0 0.9 0.0 -``` +$$ +q^*_n = \arg\max_s V[s][n] +$$ -**Transition Matrix P(tag_j|tag_i):** +
-``` - Noun Verb Det -Noun 0.2 0.1 0.6 -Verb 0.5 0.2 0.3 -Det 0.8 0.2 0.0 -``` +#### 4. Backtracking -#### **Viterbi Table Computation** +For **j = n-1** to **1**: -**Time t=1 (Book):** +
-- V[Noun][1] = 0.33 × 0.3 = 0.10 -- V[Verb][1] = 0.33 × 0.7 = 0.23 -- V[Det][1] = 0.33 × 0.0 = 0.00 +$$ +q^*_j = \text{Path}[q^*_{j+1}][j+1] +$$ -**Time t=2 (that):** +
-- V[Noun][2] = max(0.10×0.2, 0.23×0.5, 0.00×0.8) × 0.1 = 0.0115 -- V[Verb][2] = max(0.10×0.1, 0.23×0.2, 0.00×0.2) × 0.0 = 0.0 -- V[Det][2] = max(0.10×0.6, 0.23×0.3, 0.00×0.0) × 0.9 = 0.0621 +--- -**Time t=3 (flight):** +### Example Walkthrough -- V[Noun][3] = max(0.0115×0.2, 0.0×0.5, 0.0621×0.8) × 0.8 = 0.0398 -- V[Verb][3] = max(0.0115×0.1, 0.0×0.2, 0.0621×0.2) × 0.1 = 0.0001 -- V[Det][3] = max(0.0115×0.6, 0.0×0.3, 0.0621×0.0) × 0.0 = 0.0 +**Sentence:** "Book that flight" +**Tags:** {Noun, Verb, Det} -**Optimal Path**: Verb → Det → Noun = "Book that flight" +**Emission Matrix P(word | tag):** ---- +| | Book | that | flight | +| ---- | ---- | ---- | ------ | +| Noun | 0.3 | 0.1 | 0.8 | +| Verb | 0.7 | 0.0 | 0.1 | +| Det | 0.0 | 0.9 | 0.0 | -#### 5. Computational Complexity +**Transition Matrix P(tag₂ | tag₁):** -- **Time Complexity**: O(N × T²) where N = sentence length, T = number of tags -- **Space Complexity**: O(N × T) for the Viterbi table +| | Noun | Verb | Det | +| ---- | ---- | ---- | --- | +| Noun | 0.2 | 0.1 | 0.6 | +| Verb | 0.5 | 0.2 | 0.3 | +| Det | 0.8 | 0.2 | 0.0 | -**Comparison**: Without dynamic programming, finding optimal path requires O(T^N) time, making Viterbi essential for practical applications. +**Viterbi Table Computation:** ---- +- **Time t=1 (Book):** -#### 6. Key Insights + - V[Noun][1] = 0.33 × 0.3 = 0.10 + - V[Verb][1] = 0.33 × 0.7 = 0.23 + - V[Det][1] = 0.33 × 0.0 = 0.00 -#### **Optimal Substructure** +- **Time t=2 (that):** -The optimal solution contains optimal solutions to subproblems - crucial for dynamic programming. + - V[Noun][2] = max(0.10×0.2, 0.23×0.5, 0.00×0.8) × 0.1 = 0.0115 + - V[Verb][2] = max(0.10×0.1, 0.23×0.2, 0.00×0.2) × 0.0 = 0.0 + - V[Det][2] = max(0.10×0.6, 0.23×0.3, 0.00×0.0) × 0.9 = 0.0621 -#### **Markov Property** +- **Time t=3 (flight):** + - V[Noun][3] = max(0.0115×0.2, 0.0×0.5, 0.0621×0.8) × 0.8 = 0.0398 + - V[Verb][3] = max(0.0115×0.1, 0.0×0.2, 0.0621×0.2) × 0.1 = 0.0001 + - V[Det][3] = max(0.0115×0.6, 0.0×0.3, 0.0621×0.0) × 0.0 = 0.0 -Current tag depends only on the previous tag, not the entire history: P(tᵢ | t₁...tᵢ₋₁) = P(tᵢ | tᵢ₋₁) +**Optimal Path:** Verb → Det → Noun = "Book that flight" -#### **Probability Balance** +--- -The algorithm optimally balances: +### Key Insights -- **Local compatibility**: How well words fit their tags (emission probabilities) -- **Global coherence**: How well tag sequences flow together (transition probabilities) +- **Optimal Substructure:** The best solution contains best solutions to subproblems. +- **Markov Property:** Each tag depends only on the previous tag. +- **Probability Balance:** The algorithm balances local (emission) and global (transition) probabilities. --- -#### 7. Applications Beyond POS Tagging +### Applications Beyond POS Tagging -- **Speech Recognition**: Finding most likely word sequences from acoustic signals -- **Bioinformatics**: Gene sequence analysis and protein structure prediction -- **Named Entity Recognition**: Identifying person, location, organization mentions -- **Machine Translation**: Word alignment between source and target languages -- **Information Extraction**: Structured data extraction from unstructured text +- Speech Recognition +- Bioinformatics (gene/protein sequence analysis) +- Named Entity Recognition +- Machine Translation +- Information Extraction --- -#### 8. Practical Considerations - -#### **Smoothing Techniques** +### Practical Considerations -Handle unseen word-tag combinations using: +- **Smoothing:** Handle unseen word-tag pairs (Laplace, Good-Turing, back-off). +- **Unknown Words:** Use morphological analysis, character features, or embeddings. -- Add-one (Laplace) smoothing -- Good-Turing estimation -- Back-off models - -#### **Unknown Words** +--- -Strategies for out-of-vocabulary words: +### Conclusion -- Morphological analysis -- Character-level features -- Word embeddings +The Viterbi algorithm is a cornerstone of sequence labeling in NLP, providing an efficient and mathematically sound method for decoding the most probable sequence of tags. This experiment lets you practice filling Viterbi tables and understanding dynamic programming in real-world POS tagging. -The Viterbi algorithm remains a cornerstone of sequence labeling, providing both theoretical elegance and practical efficiency for natural language processing tasks. +Viterbi Decoding Animation diff --git a/experiment/trivia.md b/experiment/trivia.md index 9e29e34..0f72a1a 100644 --- a/experiment/trivia.md +++ b/experiment/trivia.md @@ -1,79 +1,21 @@ -### Historical Facts +### Fun Facts About Viterbi Decoding -🔍 **Did you know?** The Viterbi algorithm was named after Andrew Viterbi, who developed it in 1967 for decoding convolutional codes in digital communications before it revolutionized sequence analysis in bioinformatics and NLP! +1. **Algorithm Origins**: The Viterbi algorithm was invented by Andrew Viterbi in 1967 for decoding signals in digital communications, but it is now a cornerstone in NLP for sequence labeling tasks like POS tagging. -🎯 **Nobel Connection:** Andrew Viterbi was awarded the 2006 Marconi Prize for his contributions to telecommunications, and his algorithm now powers everything from speech recognition to gene sequencing. +2. **Efficiency Breakthrough**: Viterbi decoding reduces the search for the best tag sequence from trillions of possibilities to a manageable computation using dynamic programming. -📚 **Cross-Domain Impact:** Originally designed for error correction in noisy communication channels, the Viterbi algorithm found its way into computational linguistics in the 1980s and became fundamental to statistical POS tagging. +3. **Cross-Disciplinary Impact**: Beyond language, Viterbi is used in speech recognition, gene sequencing, error correction in telecommunications, and even financial modeling. -### Technical Insights +4. **Optimal Path Guarantee**: Unlike heuristic algorithms, Viterbi always finds the most probable sequence of tags given the model’s probabilities. -⚡ **Complexity Magic:** The Viterbi algorithm reduces the complexity of finding the best POS sequence from exponential O(T^N) to polynomial O(N×T²), making real-time tagging possible! +5. **Memory Magic**: The algorithm only needs to remember the best path to each state at each step, making it both fast and memory-efficient. -🧠 **Dynamic Programming Genius:** The algorithm's brilliance lies in the optimal substructure property - the best path to any state contains the best paths to all previous states. +6. **Tiny Probabilities**: Viterbi often works with extremely small probabilities (like 10⁻¹⁵), so implementations use logarithms to avoid underflow errors. -🔢 **Memory Efficiency:** Despite evaluating millions of possible tag sequences, Viterbi only needs to remember the best path to each state at each time step, dramatically reducing memory requirements. +7. **Real-Time Applications**: Modern smartphones use Viterbi-based algorithms for autocorrect and voice-to-text, enabling fast and accurate language processing. -### Viterbi Algorithm Specifics +8. **Ambiguous Words**: Words like "book," "can," and "round" can be tagged as different parts of speech depending on context—Viterbi helps resolve these ambiguities. -� **Decoding Challenge:** For a 10-word sentence with 45 possible POS tags, there are 45^10 = 2.8 trillion possible tag sequences! Viterbi finds the best one efficiently. +9. **Educational Value**: Understanding Viterbi decoding is foundational for learning about more advanced neural sequence models like LSTMs and Transformers. -🔍 **Backtracking Beauty:** The algorithm fills the probability table forward but traces the optimal path backward - like solving a maze by remembering the best route to each junction. - -� **Probability Precision:** Viterbi calculations often involve very small probabilities (like 10^-15), requiring careful numerical handling to avoid underflow errors in implementations. - -### Computational Curiosities - -💻 **Matrix Operations:** Each cell in the Viterbi table requires T multiplications and comparisons, where T is the number of POS tags - the algorithm is essentially a smart matrix multiplication! - -🔍 **Path Optimization:** Unlike other algorithms that might find "good enough" solutions, Viterbi is guaranteed to find the globally optimal POS tag sequence given the HMM parameters. - -📊 **Training vs. Decoding:** Training an HMM requires counting occurrences in the corpus, but Viterbi decoding uses those probabilities to make optimal predictions on new sentences. - -### Practical Applications - -🌐 **Beyond POS Tagging:** The Viterbi algorithm is used in speech recognition, bioinformatics (gene sequencing), and even predicting stock market trends! - -📱 **Real-Time Processing:** Modern smartphones use Viterbi-based algorithms for autocorrect and voice-to-text conversion, processing speech in real-time. - -🔤 **Error Correction:** The algorithm's original purpose in telecommunications - correcting transmission errors - shares the same mathematical foundation as finding optimal POS sequences. - -### Fun Challenges - -🎯 **Tricky Words:** Words like "that," "will," and "can" are among the most challenging for POS taggers due to their multiple grammatical roles. - -🔀 **Context Matters:** The word "book" can be a noun ("read a book") or a verb ("book a flight"), showing why sequential context is important. - -📝 **Rare Phenomena:** Some words can function as almost any part of speech - "round" can be a noun, verb, adjective, adverb, or preposition! - -### Educational Insights - -🎓 **Learning Challenge:** Students often confuse forward probability (likelihood of observations) with Viterbi probability (likelihood of the best path) - they're related but different! - -📈 **Debugging Tip:** When Viterbi gives unexpected results, check if emission and transition probabilities sum correctly and whether the corpus represents the test domain. - -🔬 **Foundation Importance:** Understanding Viterbi is crucial for grasping modern neural sequence models like LSTMs and Transformers, which use similar dynamic programming principles. - -### Algorithm Surprises - -📊 **Optimality Guarantee:** The Viterbi algorithm is guaranteed to find the most probable tag sequence - no heuristic approximation needed! - -🎲 **Probability Precision:** The algorithm handles probabilities so small that standard floating-point arithmetic fails - logarithmic computation is essential in practice. - -🔄 **Table Filling Magic:** Each cell calculation in the Viterbi table depends only on the previous column, enabling efficient parallel computation and memory optimization. - -### Cultural and Linguistic Notes - -🌍 **Language Variation:** Different languages have varying numbers of POS categories - Chinese has fewer than English, while agglutinative languages like Turkish have many more. - -📚 **Historical Change:** The parts of speech we use today were largely codified by ancient Greek and Latin grammarians over 2,000 years ago. - -🎨 **Creative Usage:** Poets often deliberately violate POS conventions (like using nouns as verbs) to create artistic effects, challenging automatic taggers. - -### Technology Evolution - -🔧 **Implementation Evolution:** Early Viterbi implementations used lookup tables and required careful memory management; modern versions leverage GPU parallel processing. - -⚙️ **From Telecommunications to NLP:** The same mathematical principles that decode satellite communications now help computers understand human language structure. - -🚀 **Neural Integration:** Modern transformer models incorporate attention mechanisms that mirror Viterbi's dynamic programming approach, showing the algorithm's enduring influence. +10. **Language Diversity**: The number of POS tags varies widely across languages—some have just a few, while others, like Turkish, have many due to rich morphology.