Skip to content

Latest commit

 

History

History
208 lines (173 loc) · 9.79 KB

File metadata and controls

208 lines (173 loc) · 9.79 KB

ATRS

Official implementation of:

Lim, H., Li, X., Park, S., Li, Q., & Kim, J. (2026). Reducing contextual noise in review-based recommendation via aspect term extraction and attention modeling. Information Sciences, 735, 123078. Paper

Overview

This repository is the official implementation of ATRS (Aspect Term-aware Recommender System), published in Information Sciences (2026).

Most review-based recommendation models process entire review bodies indiscriminately, allowing aspect-relevant signal to be diluted by surrounding context. ATRS addresses this by routing review text through a dedicated Aspect Term Extraction (ATE) stage that filters out non-aspect content before downstream encoding.

The retained aspect terms are encoded with a 1D-CNN over Word2Vec embeddings, fused with user/item ID embeddings, and passed through a self-attention block to form aspect-aware user and item representations. These are concatenated and forwarded to an MLP that predicts a continuous rating score as a regression target. Quantitative comparisons against representative recommendation baselines on Amazon and Yelp datasets are reported in Experimental Results.

Repository Structure

├── data/
│   ├── raw/                        # Source datasets — place {fname}.{raw_ext} here
│   ├── processed/                  # Pipeline parquet caches (preprocessed / aspects)
│   ├── ate_output/                 # PyABSA workspace + extraction JSON
│   │   └── .pyabsa/                # Contained pyabsa CWD: checkpoints/, checkpoints.json, result JSON
│   └── ATRS Architecture.png
│
├── model/
│   ├── atrs.py                     # ATRS architecture, trainer, and predictor
│   └── save/                       # Best checkpoint per dataset (best.pth)
│
├── src/
│   ├── config.yaml                 # Single source of truth for all hyperparameters
│   ├── data_processing.py          # DataProcessor pipeline + RecommenderDataset + DataLoader factory
│   ├── aspect_extraction.py        # ATExtractor — PyABSA wrapper for aspect term extraction
│   ├── preprocessing.py            # Review text cleaning + k-core filter
│   ├── path.py                     # Project path constants (auto-creates runtime folders)
│   └── utils.py                    # Metrics, parquet/yaml/seed helpers, gz loader
│
├── main.py                         # Entry point: data preparation → train → test
├── requirements.txt
├── README.md
└── .gitignore

Model Description

ATRS consists of two sequential modules. The full architecture is illustrated below.

ATRS Architecture

1. Aspect Term Extraction Module

A pretrained Transformer encoder (PyABSA's English ATE checkpoint, FAST-LCF-ATEPC over DeBERTa-v3-base) reads each cleaned review and emits BIO-tagged aspect terms. Per-row aspect lists are then aggregated into per-user and per-item aspect sets, which become the inputs to the RS module.

Implementation: src/aspect_extraction.py, invoked from src/data_processing.py.

2. Recommender System Module

Each user and item aspect set is tokenized over a Word2Vec-trained vocabulary, encoded by a 1D-CNN (AspectEncoder), and concatenated with a learned ID embedding. The fused vector is projected and passed through a multi-head self-attention + FFN block (SelfAttentionBlock, Eqs 5–10) to yield aspect-aware user (F_u) and item (F_v) representations. Their concatenation is fed to an MLP regressor that outputs the predicted rating (Eqs 11–12).

Implementation: AspectEncoder, SelfAttentionBlock, ATRS.regressor in model/atrs.py.

How to Run

Configuration

All hyperparameters live in src/config.yaml — it is the single source of truth. Defaults reproduce the paper experiments.

The torch==2.3.1+cu121 / torchvision==0.18.1+cu121 wheels in requirements.txt target an RTX 3080 Ti (CUDA 12.1). A CUDA-capable GPU is requiredmain.py raises RuntimeError if no CUDA device is detected.

End-to-end run from a fresh checkout:

conda create -n atrs python=3.11
conda activate atrs
pip install -r requirements.txt
python main.py

Data Preparation

Place the dataset as data/raw/{fname}.{raw_ext} where {fname} and {raw_ext} match data.fname / data.raw_ext in config.yaml.

Required columns in raw JSONL: user_id, parent_asin, text, rating (an aspect column with pre-extracted terms is optional — if present, the ATE stage is skipped)

The pipeline writes two cached artifacts under data/processed/ plus the final model checkpoint. On re-run, any artifact already on disk is reused as-is — to invalidate, delete the file. The train/test split, Word2Vec embeddings, and sequence padding are rebuilt in memory on every run.

{fname}_preprocessed.parquet — after text cleaning and k-core filter: raw columns + clean_text (HTML/URL-stripped, lowercased, contractions-expanded, stopwords-removed, lemmatized review body)

{fname}_aspects.parquet — after PyABSA aspect extraction and per-user/item aggregation: preprocessed columns + aspect (per-row term list), user_aspect_set (flattened concatenation per user), item_aspect_set (flattened concatenation per item)

Re-runs and caching

On every call to python main.py, the pipeline auto-skips any cache layer already on disk (aspects → preprocessed → raw). The train/test split, Word2Vec, and sequence padding always run fresh in memory — so changes to test_size, random_state, val_ratio, aspect_length_percentile, or w2v_* take effect immediately on the next run. Only k_core requires manually deleting {fname}_preprocessed.parquet to re-trigger the upstream filter.

PyABSA's ./checkpoints.json and ./checkpoints/ directory are hardcoded CWD-relative inside the library; ATRS routes them under data/ate_output/.pyabsa/ via a chdir context so they don't pollute the project root.

Experimental Results

ATRS was evaluated on three real-world review datasets: Musical Instruments, Video Games, and Yelp (Pennsylvania). The results demonstrate that ATRS consistently outperforms representative baselines across all evaluation metrics, achieving average improvements of 19.54% in MAE and 11.89% in RMSE.

Model Musical Instruments Video Games Yelp
MAE MSE RMSE MAPE MAE MSE RMSE MAPE MAE MSE RMSE MAPE
PMF 1.3062.6401.62535.034 1.2202.4071.55133.948 1.2762.8031.67438.330
NCF 1.1741.7051.30635.401 0.9481.3311.15435.032 1.0851.6741.29439.320
DeepCoNN 0.7861.1371.06729.931 0.8471.2631.12432.850 0.9371.3811.17538.276
NARRE 0.7670.9930.99729.459 0.7761.1731.08330.518 0.8861.2121.10136.724
AENAR 0.6650.9700.98527.193 0.6931.0021.00128.039 0.8451.1771.08535.605
SAFMR 0.7050.9750.98728.388 0.7111.0331.01630.016 0.8811.2291.10936.076
MFNR 0.7080.9650.98226.922 0.7300.9800.99027.863 0.8551.1741.08433.923
ATRS (Proposed) 0.6400.9330.96626.638 0.6460.9700.98527.537 0.8321.1631.07834.917

Citation

If you use this repository in your research, please cite:

@article{LIM2026123078,
  title = {Reducing contextual noise in review-based recommendation via aspect term extraction and attention modeling},
  author = {Heena Lim and Xinzhe Li and Seonu Park and Qinglong Li and Jaekyeong Kim},
  journal = {Information Sciences},
  volume = {735},
  pages = {123078},
  year = {2026},
  doi = {10.1016/j.ins.2026.123078}
}

Contact

For research inquiries or collaborations, please contact:

Seonu Park Ph.D. Student, Department of Big Data Analytics Kyung Hee University Email: sunu0087@khu.ac.kr

Qinglong Li Assistant Professor, Division of Computer Engineering Hansung University Email: leecy@hansung.ac.kr

Last updated: April 2026