SHIFT is a novel framework designed to overcome the limitations of traditional topic modelling approaches in heterogeneous, multimodal domains. SHIFT moves beyond discovering topics purely from data patterns by allowing seed-guided initialization, enabling hierarchical topic evolution, and supporting multimodal data within a shared semantic space.
SHIFT addresses four main challenges in topic modelling:
- Seed-Guidance: Unlike unsupervised methods, SHIFT allows users to inject domain expertise via initial "seed topics" while maintaining the ability to discover novel sub-topics.
- Hierarchies: Topics evolve iteratively, generating descendant topics through clustering operations. This captures parent-child relationships and fine-grained thematic distinctions.
- Multimodality: Built on context-aware embeddings (e.g., CLIP), SHIFT operates on abstract vector representations. This makes it modality-agnostic, capable of processing text, images, and mixed-modal datasets seamlessly.
- Knowledge Base Integration: The framework can integrate external resources (like WordNet or domain ontologies) to compute component embeddings, grounding discovered topics in authoritative definitions rather than just statistical co-occurrence.
SHIFT operates on three fundamental entities within a shared
- Units: Raw observational data (sentences, images).
- Components: Semantically meaningful elements extracted from Units (n-grams, objects, entities).
- Topics: Thematic clusters defined by sets of Components.
The framework follows an iterative Classify-Enrich-Evolve cycle:
- Classify: Units are assigned to Topics based on semantic similarity.
- Enrich: Topics update their definitions by absorbing representative Components from the newly classified Units.
- Evolve: Topics generate sub-topics (children) by clustering their defining Components.
To install SHIFT, we recommend using Conda to manage the environment and dependencies.
# Create and activate the environment
conda create -n shift python=3.10
conda activate shift
# Configure channels and install dependencies
conda config --append channels pytorch
conda install --yes --file requirements.txtThis example demonstrates how to process a textual dataset (20 Newsgroups), provide seed sentences to guide the topics, and run the training cycle.
import torch
from sklearn.datasets import fetch_20newsgroups
from shift.data_handling.data_management import DataManager
from shift.data_handling.data_processing import DataProcessor
from shift.core.topic_modeling import SHIFT
# Setup device
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
TRANSFORMER = "sentence-transformers/all-MiniLM-L6-v2"
# 1. Load Data
documents = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
# 2. Initialize Managers
manager = DataManager(embedding_size=384)
processor = DataProcessor(
data_manager=manager,
transformer=TRANSFORMER,
batch_size=32,
device=DEVICE
)
# 3. Process Raw Data
processor.process_data(data=documents)
# 4. Define Seed Topics (Sentences describing the desired topics)
topics = [
["Atheism is an absence of belief in the existence of deities."],
["A personal computer, often referred to as a PC, is a computer designed for individual use."],
["A gun is a device designed to propel a projectile using pressure or explosive force."],
]
processor.process_topic_data(topics)
# 5. Initialize and Fit SHIFT
# alpha: classification threshold, beta: enrichment threshold
model = SHIFT(alpha=0.3, beta=0.3)
model.fit(manager, n_cycles=9)SHIFT can leverage external Knowledge Bases (KB) like WordNet to define components. Instead of averaging tokens to get an embedding, SHIFT looks up the definition in the KB and encodes that definition.
# Imports and setup as above
manager_kb = DataManager(embedding_size=384)
processor_kb = DataProcessor(
data_manager=manager_kb, transformer=TRANSFORMER, batch_size=32, device=DEVICE
)
# Process text using WordNet for component extraction and embedding
processor_kb.process_textual_data(
textual_data=documents,
document_cleaner="space",
sentence_splitter=f"nltk;{TRANSFORMER};None;10;{DEVICE}",
encoder=f"base;{TRANSFORMER};{DEVICE}",
component_extractor="wordnet;eng;3",
embedder=f"knowledge_base;{TRANSFORMER};{DEVICE}",
merge_components=False,
)
# Provide simple labels as seeds; the system will look up their definitions
topics_kb = [
"atheism",
"personal computer",
"gun",
]
processor_kb.process_topic_data(
topics_kb,
topic_processor=f"wordnet;{TRANSFORMER};eng;{DEVICE}",
merge_components=False,
)
# Proceed with SHIFT initialization and fitting.SHIFT is capable of handling mixed datasets. In this example, we combine text documents with images from the VOCDetection dataset. We use CLIP for the shared embedding space and YOLO for visual object extraction.
import torch
from torchvision.datasets import VOCDetection
from shift.data_handling.data_management import DataManager
from shift.data_handling.data_processing import DataProcessor
from shift.core.topic_modeling import SHIFT
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
DATA_PATH = "path/to/data"
# Use a multimodal encoder (CLIP)
TRANSFORMER = "openai/clip-vit-base-patch32"
EMBEDDING_SIZE = 512
# 1. Prepare Mixed Data
documents = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
dataset = VOCDetection(root=f"{DATA_PATH}", year="2012", image_set="train", download=True)
images = [i[0] for i in dataset]
multimodal_data = documents + images
# 2. Configure Processor
manager = DataManager(embedding_size=EMBEDDING_SIZE)
processor = DataProcessor(
data_manager=manager, transformer=TRANSFORMER, batch_size=64, device=DEVICE
)
# 3. Process Data (Textrank for text, YOLO for images)
processor.process_data(
data=multimodal_data,
encoder=f"clip;{TRANSFORMER};{DEVICE}",
embedder=f"clip;{TRANSFORMER};{DEVICE}",
document_cleaner="space",
sentence_splitter=f"nltk;{TRANSFORMER};None;10;{DEVICE}",
text_component_extractor=f"textrank;en_core_web_lg;{TRANSFORMER};NOUN,PROPN,ADJ;{DEVICE}",
visual_component_extractor=f"yoloe;./models/yoloe-v8l-seg-pf.pt;{TRANSFORMER};0.1;{DEVICE}",
merge_components=True,
merge_threshold=0.9,
)
# 4. Define Multimodal Seeds
# Seeds can be text strings or actual image objects
topic_image_person = dataset[0][0]
topic_image_car = dataset[7][0]
topics = [
["atheism", "religion", "god"], # Text-only seed
[topic_image_person], # Image-only seed
[topic_image_car, "automobile"], # Mixed seed
]
processor.process_topic_data(
topics,
topic_processor=f"contextual;{TRANSFORMER};multimodal;{DEVICE}"
)
# 5. Run SHIFT
model = SHIFT(alpha=0.3, beta=0.3)
model.fit(manager, n_cycles=9)If you use SHIFT in your research, please cite the following thesis/paper:
@phdthesis{shift_framework,
title = {Adaptive Frameworks for Knowledge Extraction in Heterogeneous Data Environments},
author = {Sergio Picascia},
school = {Università degli Studi di Milano},
year = {2025}
}