Skip to content

mab0205/Novelty-Detection-Data-Science

Repository files navigation

Emerging Topic Detection in News Using Unsupervised Methods

This repository contains the full implementation of my undergraduate thesis in Computer Engineering:
“Identification of Emerging Topics in News Through Unsupervised Methods.”

The project investigates how different unsupervised anomaly detection algorithms and a Retrieval-Augmented Generation (RAG) pipeline can detect novelty in large news datasets. Novelty refers to documents that introduce new semantic information compared to a known reference set.

All experiments were conducted using the benchmark TAP-DLND 1.0 corpus, which provides source–target news pairs for novelty classification.


🔍 Project Overview

Traditional supervised models often require labeled datasets, which are not always available in real-world journalism.
This project explores unsupervised alternatives, including:

  • Local Outlier Factor (LOF)
  • Isolation Forest
  • Elliptic Envelope
  • RAG + LLM classifier (DeepSeek / GPT-based)

These methods treat novelty detection as an outlier detection problem, following the definitions used in the thesis and literature.

The workflow includes:

  1. Exploratory data analysis
  2. Preprocessing and vectorization (TF-IDF, PCA, embeddings)
  3. Running unsupervised models per event
  4. RAG-based semantic comparison
  5. Quantitative evaluation (precision, recall, F1-score, accuracy)

📂 Repository Structure

📦 Project Structure

Directory / File Description
01 - Research and Results/ PDF papers and experiment results.
analysis_imgs/ Images generated during analysis and experiments.
database/ Processed TAP-DLND dataset (source/target news).
Exploratory_Analysis/ Generate visualizations and assets from the exploratory analysis stage.
models/ Models produced during experimentation.
nltk_data/ Local NLTK resources used for tokenization and preprocessing.
old_solutions/ Deprecated scripts and previous versions of the workflow.
results/ Consolidated results (metrics, CSVs, plots) from all classifiers.
scripts/ Main code and notebooks used in the project pipeline.
├── 01-exploratory_analysis.ipynb Exploratory Data Analysis (EDA).
├── 02-tf_idf_by_event_sistem_LOF.ipynb LOF-based novelty detection.
├── 03-tf_idf_by_event_sistem_Elliptic.ipynb Elliptic Envelope classifier.
├── 04-tf_idf_by_event_sistem_Isolation.ipynb Isolation Forest classifier.
├── 05-RAG_main_arquitecture.ipynb Full Retrieval-Augmented Generation pipeline.
README.md Project documentation.
.gitignore Git ignore rules for the repository.

📊 Main Results

Among all tested methods, LOF achieved the best performance, reaching:

F1-score: 80.90%

Accuracy: 85.80%

These results outperform several supervised baselines reported in the literature, demonstrating that unsupervised novelty detection remains a competitive approach.

Full tables and figures are available in /01 - Research and Results/.

📘 Reference

This project implements the methodology, architecture, and theoretical foundations described in:

Martín Ávila Buitrón, “Identificação de Temas Emergentes em Notícias Através de Métodos Não-Supervisionados”, UTFPR, 2025.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors