📄 Paper: SUMMIR: Sentence Unified Multimetric Model for Importance Ranking (Preprint)
This project provides a complete framework for ranking sports-related sentences based on their importance and relevance. The system leverages large language models (LLMs) to generate insights from sports articles and then uses custom-trained reward models to rank these insights. The project is structured to handle everything from data preparation to model training and evaluation.
Additionally we introduce a novel architecture SUMMIR: Sentence Unified Multimetric Model for Importance Ranking designed to rank these insights according to user-specific interests. These rankings are further tested and verified against gold standard rankings.
The goal of this project is to automatically rank sentences from sports articles to highlight the most important information. This is achieved through a multi-stage process:
- Data Collection and Validation: Sports articles are collected and validated to ensure they are relevant to specific games.
- Insight Generation: An LLM is used to generate key insights from the validated articles.
- Factual Scoring: The generated insights are scored for factual consistency against the original articles.
- Model Training: Reward models (Llama 3.2 1B and 3B) are trained using Proximal Policy Optimization (PPO) to rank the insights. The models are optimized for either NDCG or Recall.
- Evaluation: The performance of the ranking models is evaluated using NDCG and Recall metrics.
graph TD
subgraph Root ["📁 Sports Insight Ranking"]
ROOT["SUMMIR"]
ROOT --> DS["📊 Dataset/"]
ROOT --> DG["⚙️ data_generation/"]
ROOT --> TC["🧠 training_code/"]
ROOT --> EC["📈 evaluation_code/"]
ROOT --> SF["📋 supplementary_files/"]
ROOT --> DOCS["📄 Documentation"]
DS --> DS_JSONL["Dataset_jsonl/"]
DS --> DS_PROC["Dataset_proccessed/"]
DS_JSONL --> DS_SPORTS["📁 Sport Corpora"]
DS_PROC --> DS_PIPE["📁 6-Stage Pipeline"]
DG --> PROMPTS["prompt_for_each_sport/"]
DG --> DG_CODE["🐍 Codes: Data Generation"]
TC --> IMAGES["images_while_training/"]
TC --> TC_CODE["🐍 Codes: Model Training"]
EC --> EC_CODE["🐍 Codes: Evaluation"]
SF --> SF_CODE["🐍 Codes: Data Processing"]
SF --> SF_DATA["📋 Lexicon CSVs"]
DOCS --> PDF1["Supplementary_Appendix.pdf"]
DOCS --> PDF2["dataset_details.pdf"]
DOCS --> CSV["response_human.csv"]
end
click DS "Dataset/README.md"
click DG "data_generation/README.md"
click TC "training_code/README.md"
click EC "evaluation_code/README.md"
click SF "supplementary_files/README.md"
style ROOT fill:#1a202c,stroke:#4a5568,color:#fff
style DS fill:#48bb78,stroke:#276749,color:#fff
style DG fill:#ed8936,stroke:#c05621,color:#fff
style TC fill:#9f7aea,stroke:#6b46c1,color:#fff
style EC fill:#f56565,stroke:#c53030,color:#fff
style SF fill:#4299e1,stroke:#2b6cb0,color:#fff
style DOCS fill:#718096,stroke:#4a5568,color:#fff
style DG_CODE fill:#fed7aa,stroke:#ea580c,color:#000
style TC_CODE fill:#ddd6fe,stroke:#7c3aed,color:#000
style EC_CODE fill:#fecaca,stroke:#dc2626,color:#000
style SF_CODE fill:#bfdbfe,stroke:#2563eb,color:#000
style SF_DATA fill:#a5f3fc,stroke:#0891b2,color:#000
style DS_JSONL fill:#86efac,stroke:#16a34a,color:#000
style DS_PROC fill:#86efac,stroke:#16a34a,color:#000
style DS_SPORTS fill:#bbf7d0,stroke:#22c55e,color:#000
style DS_PIPE fill:#bbf7d0,stroke:#22c55e,color:#000
| Directory | Purpose | Documentation |
|---|---|---|
Dataset/ |
Sport-specific training datasets | README |
data_generation/ |
Insight extraction pipeline | README |
training_code/ |
PPO reward model training | README |
evaluation_code/ |
NDCG/Recall evaluation | README |
supplementary_files/ |
Feature computation data | README |
The data generation process is handled by the scripts in the data_generation/ directory.
article_validation_save.py: This script validates that the collected sports articles are relevant to the intended match. It uses a pre-trained language model to classify articles as "relevant" or "irrelevant".insight_generation.py: Once articles are validated, this script uses a powerful LLM to generate a structured set of insights from the text.factScore.py: To ensure the quality of the generated insights, this script uses GPT-4o to score the factual accuracy of each insight against the source article.summacConv.py: As an alternative or complementary step tofactScore.py, this script uses the SummaCConv model to evaluate the factual consistency of the insights.
The core of this project is the training of the ranking models, which is done using the scripts in the training_code/ directory.
- Models: The project uses Llama 3.2 models with 1B and 3B parameters.
- Training Method: The models are fine-tuned using Proximal Policy Optimization (PPO), a reinforcement learning algorithm.
- Reward Signals: Two different reward functions are used for training:
- NDCG (Normalized Discounted Cumulative Gain): The
*_ndcg_only.pyscripts train the model to optimize the ranking of insights based on the NDCG metric. - Recall: The
*_recall_only.pyscripts train the model to maximize the recall of the top-ranked insights. - SUMMIR: The
6_metrics_Training_code.pyscripts trains the model according to our novel framework.
- NDCG (Normalized Discounted Cumulative Gain): The
The training process uses the datasets in the Dataset/ folder.
The performance of the trained ranking models is assessed using the evaluation_code/improvised_Evaluation_code.ipynb notebook. This notebook:
- Loads a trained model.
- Generates rankings for a sample of data points.
- Calculates
NDCG@kandRecall@k(for k=2, 5, 10) to measure the quality of the rankings.
For further validation, we compare our model's rankings against human evaluations.
SUMMIR_example_response.md: This file contains the sentences that were distributed for human evaluation.response_human.csv: This file contains the ranking responses from 30 different people, which serves as a gold standard for our evaluation.
- Python 3.x
- PyTorch and other required libraries (see the
importstatements in the scripts). - Access to the pre-trained models specified in the scripts (e.g., Llama 3.2, Qwen, etc.). You will need to update the model paths in the scripts to point to your local model directories.
- An OpenAI API key for
factScore.py.
-
Data Generation:
- Run
data_generation/article_validation_save.pyto validate the raw articles. - Run
data_generation/insight_generation.pyto generate insights from the validated articles. - Run
data_generation/factScore.pyand/ordata_generation/summacConv.pyto score the insights.
- Run
-
Model Training:
- Choose the model size (1B or 3B) and the reward metric (NDCG or Recall).
- Execute the corresponding script from the
training_code/directory. For example, to train the 3B model with NDCG:python training_code/Llama-3.2-3B-ndcg_only.py
-
Evaluation:
- Open and run the
evaluation_code/improvised_Evaluation_code.ipynbnotebook. - Make sure to update the
LLAMA_MODEL_PATHvariable in the notebook to point to your trained model's directory.
- Open and run the
- Article Validation:
Qwen2.5-32B-Instruct - Insight Generation:
DeepSeek-R1-Distill-Llama-70B - Factual Scoring:
gpt-4o,SummaCConv
- Llama 3.2 1B: A smaller, more efficient model for ranking.
- Llama 3.2 3B: A larger, more powerful model for higher accuracy ranking.
The supplementary_files/ directory contains CSV files that can be used for our novel approach :
processed_persons.csv: A list of names of individuals (e.g., players, coaches) that can be used to identify key people in the text.sports_keywords.csv: A collection of sports-related keywords.sports_sentiment.csv: A list of words with associated sentiment scores, which can be used for sentiment analysis.