This repository contains the full implementation of my undergraduate thesis in Computer Engineering:
“Identification of Emerging Topics in News Through Unsupervised Methods.”
The project investigates how different unsupervised anomaly detection algorithms and a Retrieval-Augmented Generation (RAG) pipeline can detect novelty in large news datasets. Novelty refers to documents that introduce new semantic information compared to a known reference set.
All experiments were conducted using the benchmark TAP-DLND 1.0 corpus, which provides source–target news pairs for novelty classification.
Traditional supervised models often require labeled datasets, which are not always available in real-world journalism.
This project explores unsupervised alternatives, including:
- Local Outlier Factor (LOF)
- Isolation Forest
- Elliptic Envelope
- RAG + LLM classifier (DeepSeek / GPT-based)
These methods treat novelty detection as an outlier detection problem, following the definitions used in the thesis and literature.
The workflow includes:
- Exploratory data analysis
- Preprocessing and vectorization (TF-IDF, PCA, embeddings)
- Running unsupervised models per event
- RAG-based semantic comparison
- Quantitative evaluation (precision, recall, F1-score, accuracy)
| Directory / File | Description |
|---|---|
| 01 - Research and Results/ | PDF papers and experiment results. |
| analysis_imgs/ | Images generated during analysis and experiments. |
| database/ | Processed TAP-DLND dataset (source/target news). |
| Exploratory_Analysis/ | Generate visualizations and assets from the exploratory analysis stage. |
| models/ | Models produced during experimentation. |
| nltk_data/ | Local NLTK resources used for tokenization and preprocessing. |
| old_solutions/ | Deprecated scripts and previous versions of the workflow. |
| results/ | Consolidated results (metrics, CSVs, plots) from all classifiers. |
| scripts/ | Main code and notebooks used in the project pipeline. |
| ├── 01-exploratory_analysis.ipynb | Exploratory Data Analysis (EDA). |
| ├── 02-tf_idf_by_event_sistem_LOF.ipynb | LOF-based novelty detection. |
| ├── 03-tf_idf_by_event_sistem_Elliptic.ipynb | Elliptic Envelope classifier. |
| ├── 04-tf_idf_by_event_sistem_Isolation.ipynb | Isolation Forest classifier. |
| ├── 05-RAG_main_arquitecture.ipynb | Full Retrieval-Augmented Generation pipeline. |
| README.md | Project documentation. |
| .gitignore | Git ignore rules for the repository. |
📊 Main Results
Among all tested methods, LOF achieved the best performance, reaching:
F1-score: 80.90%
Accuracy: 85.80%
These results outperform several supervised baselines reported in the literature, demonstrating that unsupervised novelty detection remains a competitive approach.
Full tables and figures are available in /01 - Research and Results/.
📘 Reference
This project implements the methodology, architecture, and theoretical foundations described in:
Martín Ávila Buitrón, “Identificação de Temas Emergentes em Notícias Através de Métodos Não-Supervisionados”, UTFPR, 2025.