SHIFT: Seed-guided HIerarchical Framework for Topic Modelling

SHIFT is a novel framework designed to overcome the limitations of traditional topic modelling approaches in heterogeneous, multimodal domains. SHIFT moves beyond discovering topics purely from data patterns by allowing seed-guided initialization, enabling hierarchical topic evolution, and supporting multimodal data within a shared semantic space.

Key Characteristics

SHIFT addresses four main challenges in topic modelling:

Seed-Guidance: Unlike unsupervised methods, SHIFT allows users to inject domain expertise via initial "seed topics" while maintaining the ability to discover novel sub-topics.
Hierarchies: Topics evolve iteratively, generating descendant topics through clustering operations. This captures parent-child relationships and fine-grained thematic distinctions.
Multimodality: Built on context-aware embeddings (e.g., CLIP), SHIFT operates on abstract vector representations. This makes it modality-agnostic, capable of processing text, images, and mixed-modal datasets seamlessly.
Knowledge Base Integration: The framework can integrate external resources (like WordNet or domain ontologies) to compute component embeddings, grounding discovered topics in authoritative definitions rather than just statistical co-occurrence.

Architecture

SHIFT operates on three fundamental entities within a shared $d$-dimensional embedding space:

Units: Raw observational data (sentences, images).
Components: Semantically meaningful elements extracted from Units (n-grams, objects, entities).
Topics: Thematic clusters defined by sets of Components.

The framework follows an iterative Classify-Enrich-Evolve cycle:

Classify: Units are assigned to Topics based on semantic similarity.
Enrich: Topics update their definitions by absorbing representative Components from the newly classified Units.
Evolve: Topics generate sub-topics (children) by clustering their defining Components.

Installation

To install SHIFT, we recommend using Conda to manage the environment and dependencies.

# Create and activate the environment
conda create -n shift python=3.10
conda activate shift

# Configure channels and install dependencies
conda config --append channels pytorch
conda install --yes --file requirements.txt

Usage

1. Basic Textual Topic Modelling

This example demonstrates how to process a textual dataset (20 Newsgroups), provide seed sentences to guide the topics, and run the training cycle.

import torch
from sklearn.datasets import fetch_20newsgroups
from shift.data_handling.data_management import DataManager
from shift.data_handling.data_processing import DataProcessor
from shift.core.topic_modeling import SHIFT

# Setup device
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
TRANSFORMER = "sentence-transformers/all-MiniLM-L6-v2"

# 1. Load Data
documents = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]

# 2. Initialize Managers
manager = DataManager(embedding_size=384)
processor = DataProcessor(
    data_manager=manager, 
    transformer=TRANSFORMER, 
    batch_size=32, 
    device=DEVICE
)

# 3. Process Raw Data
processor.process_data(data=documents)

# 4. Define Seed Topics (Sentences describing the desired topics)
topics = [
    ["Atheism is an absence of belief in the existence of deities."],
    ["A personal computer, often referred to as a PC, is a computer designed for individual use."],
    ["A gun is a device designed to propel a projectile using pressure or explosive force."],
]
processor.process_topic_data(topics)

# 5. Initialize and Fit SHIFT
# alpha: classification threshold, beta: enrichment threshold
model = SHIFT(alpha=0.3, beta=0.3)
model.fit(manager, n_cycles=9)

2. Knowledge Base Integration

SHIFT can leverage external Knowledge Bases (KB) like WordNet to define components. Instead of averaging tokens to get an embedding, SHIFT looks up the definition in the KB and encodes that definition.

# Imports and setup as above

manager_kb = DataManager(embedding_size=384)
processor_kb = DataProcessor(
    data_manager=manager_kb, transformer=TRANSFORMER, batch_size=32, device=DEVICE
)

# Process text using WordNet for component extraction and embedding
processor_kb.process_textual_data(
    textual_data=documents,
    document_cleaner="space",
    sentence_splitter=f"nltk;{TRANSFORMER};None;10;{DEVICE}",
    encoder=f"base;{TRANSFORMER};{DEVICE}",
    component_extractor="wordnet;eng;3", 
    embedder=f"knowledge_base;{TRANSFORMER};{DEVICE}", 
    merge_components=False,
)

# Provide simple labels as seeds; the system will look up their definitions
topics_kb = [
    "atheism",
    "personal computer",
    "gun",
]

processor_kb.process_topic_data(
    topics_kb,
    topic_processor=f"wordnet;{TRANSFORMER};eng;{DEVICE}", 
    merge_components=False,
)

# Proceed with SHIFT initialization and fitting.

3. Multimodal Topic Modelling (Text + Images)

SHIFT is capable of handling mixed datasets. In this example, we combine text documents with images from the VOCDetection dataset. We use CLIP for the shared embedding space and YOLO for visual object extraction.

import torch
from torchvision.datasets import VOCDetection
from shift.data_handling.data_management import DataManager
from shift.data_handling.data_processing import DataProcessor
from shift.core.topic_modeling import SHIFT

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DATA_PATH = "path/to/data"

# Use a multimodal encoder (CLIP)
TRANSFORMER = "openai/clip-vit-base-patch32"
EMBEDDING_SIZE = 512 

# 1. Prepare Mixed Data
documents = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]

dataset = VOCDetection(root=f"{DATA_PATH}", year="2012", image_set="train", download=True)
images = [i[0] for i in dataset]

multimodal_data = documents + images

# 2. Configure Processor
manager = DataManager(embedding_size=EMBEDDING_SIZE)
processor = DataProcessor(
    data_manager=manager, transformer=TRANSFORMER, batch_size=64, device=DEVICE
)

# 3. Process Data (Textrank for text, YOLO for images)
processor.process_data(
    data=multimodal_data,
    encoder=f"clip;{TRANSFORMER};{DEVICE}",
    embedder=f"clip;{TRANSFORMER};{DEVICE}",
    document_cleaner="space",
    sentence_splitter=f"nltk;{TRANSFORMER};None;10;{DEVICE}",
    text_component_extractor=f"textrank;en_core_web_lg;{TRANSFORMER};NOUN,PROPN,ADJ;{DEVICE}",
    visual_component_extractor=f"yoloe;./models/yoloe-v8l-seg-pf.pt;{TRANSFORMER};0.1;{DEVICE}",
    merge_components=True,
    merge_threshold=0.9,
)

# 4. Define Multimodal Seeds
# Seeds can be text strings or actual image objects
topic_image_person = dataset[0][0]
topic_image_car = dataset[7][0]

topics = [
    ["atheism", "religion", "god"],  # Text-only seed
    [topic_image_person],            # Image-only seed
    [topic_image_car, "automobile"], # Mixed seed
]

processor.process_topic_data(
    topics, 
    topic_processor=f"contextual;{TRANSFORMER};multimodal;{DEVICE}"
)

# 5. Run SHIFT
model = SHIFT(alpha=0.3, beta=0.3)
model.fit(manager, n_cycles=9)

Citation

If you use SHIFT in your research, please cite the following thesis/paper:

@phdthesis{shift_framework,
    title    = {Adaptive Frameworks for Knowledge Extraction in Heterogeneous Data Environments},
    author   = {Sergio Picascia},
    school   = {Università degli Studi di Milano},
    year     = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
shift		shift
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SHIFT: Seed-guided HIerarchical Framework for Topic Modelling

Key Characteristics

Architecture

Installation

Usage

1. Basic Textual Topic Modelling

2. Knowledge Base Integration

3. Multimodal Topic Modelling (Text + Images)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SHIFT: Seed-guided HIerarchical Framework for Topic Modelling

Key Characteristics

Architecture

Installation

Usage

1. Basic Textual Topic Modelling

2. Knowledge Base Integration

3. Multimodal Topic Modelling (Text + Images)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages