Official implementation of:
Lim, H., Li, X., Park, S., Li, Q., & Kim, J. (2026). Reducing contextual noise in review-based recommendation via aspect term extraction and attention modeling. Information Sciences, 735, 123078. Paper
This repository is the official implementation of ATRS (Aspect Term-aware Recommender System), published in Information Sciences (2026).
Most review-based recommendation models process entire review bodies indiscriminately, allowing aspect-relevant signal to be diluted by surrounding context. ATRS addresses this by routing review text through a dedicated Aspect Term Extraction (ATE) stage that filters out non-aspect content before downstream encoding.
The retained aspect terms are encoded with a 1D-CNN over Word2Vec embeddings, fused with user/item ID embeddings, and passed through a self-attention block to form aspect-aware user and item representations. These are concatenated and forwarded to an MLP that predicts a continuous rating score as a regression target. Quantitative comparisons against representative recommendation baselines on Amazon and Yelp datasets are reported in Experimental Results.
├── data/
│ ├── raw/ # Source datasets — place {fname}.{raw_ext} here
│ ├── processed/ # Pipeline parquet caches (preprocessed / aspects)
│ ├── ate_output/ # PyABSA workspace + extraction JSON
│ │ └── .pyabsa/ # Contained pyabsa CWD: checkpoints/, checkpoints.json, result JSON
│ └── ATRS Architecture.png
│
├── model/
│ ├── atrs.py # ATRS architecture, trainer, and predictor
│ └── save/ # Best checkpoint per dataset (best.pth)
│
├── src/
│ ├── config.yaml # Single source of truth for all hyperparameters
│ ├── data_processing.py # DataProcessor pipeline + RecommenderDataset + DataLoader factory
│ ├── aspect_extraction.py # ATExtractor — PyABSA wrapper for aspect term extraction
│ ├── preprocessing.py # Review text cleaning + k-core filter
│ ├── path.py # Project path constants (auto-creates runtime folders)
│ └── utils.py # Metrics, parquet/yaml/seed helpers, gz loader
│
├── main.py # Entry point: data preparation → train → test
├── requirements.txt
├── README.md
└── .gitignoreATRS consists of two sequential modules. The full architecture is illustrated below.
A pretrained Transformer encoder (PyABSA's English ATE checkpoint, FAST-LCF-ATEPC over DeBERTa-v3-base) reads each cleaned review and emits BIO-tagged aspect terms. Per-row aspect lists are then aggregated into per-user and per-item aspect sets, which become the inputs to the RS module.
Implementation: src/aspect_extraction.py, invoked from src/data_processing.py.
Each user and item aspect set is tokenized over a Word2Vec-trained vocabulary, encoded by a 1D-CNN (AspectEncoder), and concatenated with a learned ID embedding. The fused vector is projected and passed through a multi-head self-attention + FFN block (SelfAttentionBlock, Eqs 5–10) to yield aspect-aware user (F_u) and item (F_v) representations. Their concatenation is fed to an MLP regressor that outputs the predicted rating (Eqs 11–12).
Implementation: AspectEncoder, SelfAttentionBlock, ATRS.regressor in model/atrs.py.
All hyperparameters live in src/config.yaml — it is the single source of truth. Defaults reproduce the paper experiments.
The torch==2.3.1+cu121 / torchvision==0.18.1+cu121 wheels in requirements.txt target an RTX 3080 Ti (CUDA 12.1). A CUDA-capable GPU is required — main.py raises RuntimeError if no CUDA device is detected.
End-to-end run from a fresh checkout:
conda create -n atrs python=3.11
conda activate atrs
pip install -r requirements.txt
python main.pyPlace the dataset as data/raw/{fname}.{raw_ext} where {fname} and {raw_ext} match data.fname / data.raw_ext in config.yaml.
Required columns in raw JSONL:
user_id, parent_asin, text, rating
(an aspect column with pre-extracted terms is optional — if present, the ATE stage is skipped)
The pipeline writes two cached artifacts under data/processed/ plus the final model checkpoint. On re-run, any artifact already on disk is reused as-is — to invalidate, delete the file. The train/test split, Word2Vec embeddings, and sequence padding are rebuilt in memory on every run.
{fname}_preprocessed.parquet — after text cleaning and k-core filter:
raw columns + clean_text (HTML/URL-stripped, lowercased, contractions-expanded, stopwords-removed, lemmatized review body)
{fname}_aspects.parquet — after PyABSA aspect extraction and per-user/item aggregation:
preprocessed columns + aspect (per-row term list), user_aspect_set (flattened concatenation per user), item_aspect_set (flattened concatenation per item)
On every call to python main.py, the pipeline auto-skips any cache layer already on disk (aspects → preprocessed → raw). The train/test split, Word2Vec, and sequence padding always run fresh in memory — so changes to test_size, random_state, val_ratio, aspect_length_percentile, or w2v_* take effect immediately on the next run. Only k_core requires manually deleting {fname}_preprocessed.parquet to re-trigger the upstream filter.
PyABSA's ./checkpoints.json and ./checkpoints/ directory are hardcoded CWD-relative inside the library; ATRS routes them under data/ate_output/.pyabsa/ via a chdir context so they don't pollute the project root.
ATRS was evaluated on three real-world review datasets: Musical Instruments, Video Games, and Yelp (Pennsylvania). The results demonstrate that ATRS consistently outperforms representative baselines across all evaluation metrics, achieving average improvements of 19.54% in MAE and 11.89% in RMSE.
| Model | Musical Instruments | Video Games | Yelp | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MAE | MSE | RMSE | MAPE | MAE | MSE | RMSE | MAPE | MAE | MSE | RMSE | MAPE | |
| PMF | 1.306 | 2.640 | 1.625 | 35.034 | 1.220 | 2.407 | 1.551 | 33.948 | 1.276 | 2.803 | 1.674 | 38.330 |
| NCF | 1.174 | 1.705 | 1.306 | 35.401 | 0.948 | 1.331 | 1.154 | 35.032 | 1.085 | 1.674 | 1.294 | 39.320 |
| DeepCoNN | 0.786 | 1.137 | 1.067 | 29.931 | 0.847 | 1.263 | 1.124 | 32.850 | 0.937 | 1.381 | 1.175 | 38.276 |
| NARRE | 0.767 | 0.993 | 0.997 | 29.459 | 0.776 | 1.173 | 1.083 | 30.518 | 0.886 | 1.212 | 1.101 | 36.724 |
| AENAR | 0.665 | 0.970 | 0.985 | 27.193 | 0.693 | 1.002 | 1.001 | 28.039 | 0.845 | 1.177 | 1.085 | 35.605 |
| SAFMR | 0.705 | 0.975 | 0.987 | 28.388 | 0.711 | 1.033 | 1.016 | 30.016 | 0.881 | 1.229 | 1.109 | 36.076 |
| MFNR | 0.708 | 0.965 | 0.982 | 26.922 | 0.730 | 0.980 | 0.990 | 27.863 | 0.855 | 1.174 | 1.084 | 33.923 |
| ATRS (Proposed) | 0.640 | 0.933 | 0.966 | 26.638 | 0.646 | 0.970 | 0.985 | 27.537 | 0.832 | 1.163 | 1.078 | 34.917 |
If you use this repository in your research, please cite:
@article{LIM2026123078,
title = {Reducing contextual noise in review-based recommendation via aspect term extraction and attention modeling},
author = {Heena Lim and Xinzhe Li and Seonu Park and Qinglong Li and Jaekyeong Kim},
journal = {Information Sciences},
volume = {735},
pages = {123078},
year = {2026},
doi = {10.1016/j.ins.2026.123078}
}For research inquiries or collaborations, please contact:
Seonu Park Ph.D. Student, Department of Big Data Analytics Kyung Hee University Email: sunu0087@khu.ac.kr
Qinglong Li Assistant Professor, Division of Computer Engineering Hansung University Email: leecy@hansung.ac.kr
Last updated: April 2026
