This repository implements and documents research on Topic-Modeled Curriculum Learning (TMCL), a novel framework that leverages topic modeling to estimate the intrinsic difficulty of training samples and guide curriculum learning for neural networks. Unlike traditional curriculum learning methods that rely on handcrafted heuristics or auxiliary models, TMCL provides a data-driven, unsupervised difficulty metric derived from the latent semantic structure of the data. The core principle is to train neural networks in an easy-to-hard order based on topic coherence, aiming to improve convergence speed, training stability, and final generalization performance.
Deep neural networks are predominantly trained on randomly shuffled datasets, which treats all samples as equally challenging and ignores the inherent semantic structure and varying complexity within the data. While Curriculum Learning (CL)—the idea of presenting samples in a meaningful order—has demonstrated benefits in accelerating learning and enhancing model robustness, most existing CL approaches suffer from significant limitations:
- Heuristic Dependency: Rely on manually defined difficulty measures (e.g., sentence length, image sharpness) that are domain-specific and often poor proxies for true learning complexity.
- Scalability Issues: Methods requiring pre-trained teacher models or reinforcement learning for curriculum generation are computationally expensive and difficult to scale.
- Lack of Generalization: Many strategies are tailored to specific tasks (e.g., NLP or Vision) and do not transfer well across domains.
This research addresses these gaps by proposing Topic-Modeled Curriculum Learning (TMCL). TMCL uses topic modeling to:
- Automatically discover the latent thematic structure within a training corpus.
- Quantify sample difficulty through statistical properties of its topic distribution.
- Construct an adaptive, semantically-informed curriculum schedule for neural network training.
The central hypothesis is that a sample's semantic "focus" or "purity," as captured by its topic distribution, correlates with its learnability. Samples with concentrated, low-entropy topic distributions are semantically coherent and "easier" to learn, while samples with high-entropy, dispersed distributions are semantically ambiguous or complex and thus "harder."
- Difficulty Metric: Can topic modeling provide a meaningful, scalable, and unsupervised measure of sample difficulty that correlates with actual training dynamics (e.g., loss convergence)?
- Training Efficacy: Does a curriculum structured by topic-modeled difficulty lead to faster convergence, lower final error, and improved generalization compared to standard random sampling?
- Comparative Performance: How does TMCL perform against established curriculum learning baselines (e.g., length-based, loss-based) and self-paced learning?
- Generalization & Robustness: Is the TMCL framework effective across diverse domains (NLP, Vision), tasks (classification, regression), and model architectures (CNNs, Transformers)?
The entropy of a sample's topic distribution is a valid proxy for its learning complexity.
Mathematical Formulation: For a sample
We hypothesize that
Neural networks trained with a TMCL schedule will exhibit superior training characteristics.
-
Convergence Speed: Models will reach a target performance threshold in fewer epochs.
Metric: Epochs to reach
$\alpha%$ of final accuracy ($\alpha \in {90, 95}$ ). -
Generalization: Models will achieve lower final test error and a smaller generalization gap.
Metric:
-
Training Stability: Loss curves will be smoother with lower variance between training runs.
Metric: Variance of training loss across epochs
$\sigma^2(\mathcal{L}_{\text{train}})$ .
The benefits of TMCL are not architecture or domain-specific and will generalize across:
-
Domains: Text (AG News, IMDb) and Image (CIFAR-10/100, MNIST) datasets.
-
Architectures: Convolutional Neural Networks (ResNet) and Transformer-based models (BERT).
Let a dataset be
Alternative Difficulty Metrics (Ablations):
-
Max Probability (Purity):
$D_{\text{max}}(x_i) = 1 - \max_t P(t \mid x_i)$ (Lower max probability ⇒ higher semantic ambiguity)
-
Topic Coherence Deviation:
For samples with ground-truth label
$y_i$ , compute the average topic distribution for class$k$ :$\bar{P}_k$ .Difficulty is defined as:
$D_{\text{dev}}(x_i) = 1 - \cos\left(P(t \mid x_i), \bar{P}_{y_i}\right)$ ,where
$\cos(\cdot, \cdot)$ denotes cosine similarity. -
Composite Score:
$D_{\text{comp}}(x_i) = \lambda H(P) + (1 - \lambda) D_{\text{max}}(x_i), \quad \lambda \in [0,1]$ where
$H(P) = -\sum_{t=1}^{T} P(t \mid x_i) \log P(t \mid x_i)$ is the Shannon entropy.
The curriculum defines a difficulty threshold
-
Linear Schedule:
$\tau_{\text{linear}}(e) = D_{\min} + \frac{e}{E} (D_{\max} - D_{\min})$ where
$E$ is the total number of epochs, and$D_{\min}, D_{\max}$ are the min/max difficulty scores in$\mathcal{D}$ . -
Root Schedule (Slow Start):
$\tau_{\text{root}}(e) = D_{\min} + \left(\frac{e}{E}\right)^{\gamma} (D_{\max} - D_{\min}), \quad \gamma < 1$ -
Exponential Schedule (Fast Start):
$\tau_{\text{exp}}(e) = D_{\max} - (D_{\max} - D_{\min}) \cdot \beta^{e}, \quad \beta \in (0,1)$
The proportion of data used at epoch
The training objective becomes:
where
-
Feature Extraction:
- Text: Bag-of-words or TF-IDF vectors.
- Images: Extract deep features from a pre-trained, frozen backbone (e.g., ResNet-18 penultimate layer) to create embedding vectors.
-
Model Fitting: Apply topic modeling (e.g., LDA or NMF) to the feature matrix to obtain the topic distribution
$P(t \mid x_i)$ for all samples$x_i \in \mathcal{D}$ . -
Difficulty Scoring: Compute the difficulty
$D(x_i)$ for each sample. By default, we use topic entropy:
-
Sort the entire dataset
$\mathcal{D}$ by$D(x_i)$ in ascending order (easiest to hardest). -
Define a curriculum schedule function
$\tau(e)$ that controls the difficulty threshold at epoch$e$ . -
For each epoch
$e$ , construct the eligible subset:
Train the target model (e.g., ResNet-18, BERT) using standard optimization (e.g., Adam), but sample mini-batches from
where
Comparison Regimes:
-
RS (Random Sampling): Standard uniform shuffling over
$\mathcal{D}$ . -
Heuristic-CL: Curriculum based on task-specific heuristics (e.g., sentence length for NLP, image sharpness for Vision).
-
SPL (Self-Paced Learning): Samples are weighted or selected based on current loss
$\mathcal{L}(f_\theta(x_i), y_i)$ . -
TMCL (Proposed): Curriculum based on unsupervised topic-modeled difficulty
$D(x_i)$ .
| Domain | Dataset | Model | Task | Topic Features |
|---|---|---|---|---|
| Vision | CIFAR-10/100 | ResNet-18/34 | Classification | ResNet-18 embeddings, clustered |
| Vision | MNIST/F-MNIST | Simple CNN | Classification | Raw pixels (flattened) or CNN embeddings |
| NLP | AG News | BERT-base | Classification | BERT [CLS] embeddings or BoW |
| NLP | IMDb | BERT-base | Sentiment Analysis | BERT [CLS] embeddings or BoW |
-
Primary: Test Accuracy, Macro F1-Score.
-
Curriculum Efficacy:
-
Convergence Speed: $$
$\text{Epochs to Acc.} = \min { e \mid \text{Acc}(e) \geq \alpha \cdot \text{Acc}_{\text{final}} }$ $$ -
Area Under the Training Curve (AUTC): $$
$\int_0^E \text{Acc}(e) de$ $$ (higher is better) -
Training Smoothness: $$
$\frac{1}{E-1}\sum_{e=1}^{E-1} \left| \mathcal{L}_{e+1} - \mathcal{L}_e \right|$ (lower is better) $$
-
-
Difficulty Metric Validation:
Compute Pearson correlation$r$ between$D(x_i)$ and the sample’s loss after the first training epoch. -
Ablation on Difficulty Metric:
Compare$D_{\text{entropy}}$ ,$D_{\text{max}}$ , and$D_{\text{comp}}$ within the TMCL framework. -
Curriculum Schedule Ablation:
Evaluate linear, root ($\gamma = 0.5, 0.7$ ), and exponential ($\beta = 0.95, 0.99$ ) schedules. -
Cross-Domain Benchmark:
Compare training regimes:- RS (Random Sampling)
- Heuristic-CL (e.g., sentence length, image sharpness)
- SPL (Self-Paced Learning)
- TMCL (Proposed)
-
Sensitivity Analysis:
Study the effect of the number of topics$T$ on final performance.
- A Novel, Unsupervised Difficulty Metric: Proposes and validates topic distribution entropy as a general, data-driven measure of sample complexity.
- A Scalable CL Framework: Provides a practical TMCL pipeline that requires no handcrafted rules or auxiliary models, making CL accessible for new domains.
- Empirical Evidence: A comprehensive benchmark demonstrating the conditions under which TMCL provides benefits over established training paradigms.
- Theoretical Insight: Contributes to the understanding of how the semantic structure of data interacts with neural network optimization dynamics.
- Dynamic Topic Models: Incorporate online/dynamic topic models (e.g., Dynamic LDA) to allow the curriculum to adapt to the model's changing understanding during training.
- Multi-Modal TMCL: Extend the framework to multi-modal data (e.g., image-caption pairs) by modeling joint topic distributions across modalities.
- Large-Scale Language Model Training: Investigate the application of TMCL for pre-training or fine-tuning LLMs, where curriculum learning could reduce computational cost.
- Continual & Lifelong Learning: Explore TMCL for task ordering in continual learning scenarios, where "topic" spaces could represent tasks or skills.
- Curriculum Learning: Bengio et al., "Curriculum Learning" (ICML 2009); Soviany et al., "Curriculum Learning: A Survey" (2021).
- Self-Paced Learning: Kumar et al., "Self-Paced Learning for Latent Variable Models" (NIPS 2010); Hacohen & Weinshall, "On The Power of Curriculum Learning in Training Deep Networks" (ICML 2019).
- Automated Curriculum: Graves et al., "Automated Curriculum Learning for Neural Networks" (ICML 2017).
- Topic Modeling: Blei et al., "Latent Dirichlet Allocation" (JMLR 2003); Miao et al., "Neural Variational Inference for Text Processing" (ICML 2016).
- Data-Centric AI: Recent shifts towards understanding data order and quality; TMCL aligns with this paradigm.
- Deep Learning: PyTorch, PyTorch Lightning, HuggingFace Transformers.
- Topic Modeling: Gensim (LDA), Scikit-learn (NMF), OCTIS for neural topic models.
- Evaluation & Analysis: Scikit-learn, Matplotlib, Seaborn, Weights & Biases (for experiment tracking).
Reiyo
Research Focus: Deep Learning, Representation Learning, Optimization, Data-Centric AI.
Thesis Topic: Topic-Modeled Curriculum Learning for Efficient and Robust Neural Network Training.
This README serves as the living document for the research project. Theoretical formulations and experimental plans are subject to refinement based on ongoing results.