KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs

This repository contains a benchmarking system for testing LLM performance on knowledge graph tasks, with support for AWS Bedrock batch processing.

Requirements

Python 3.8 or higher
See requirements.txt for the list of dependencies.

Citation

@article{Markowitz2025KGLLMBenchAS, title={KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs}, author={Elan Markowitz and Krupa Galiya and Greg Ver Steeg and A. G. Galstyan}, journal={}, year={2025}, volume={abs/2504.07087}, url={https://api.semanticscholar.org/CorpusID:277634268} }

Updates

KG-LLM-Bench won Best Resource Paper at the 4th Workshop on Knowledge-Augmented NLP at NAACL 2025

Requirements

Python 3.8 or higher
See requirements.txt for the list of dependencies.

Setup

Follow these steps to set up the project:

Clone the Repository
Clone this repository to your local machine:

git clone https://github.com/Elanmarkowitz/kg-llm-bench.git
cd kg-llm-bench

Create a Virtual Environment (Optional)
Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate  # On macOS/Linux
venv\Scripts\activate     # On Windows

Install Dependencies
Install the required Python packages:
```
pip install -r requirements.txt
```
Set Up Environment Variables
This project uses python-dotenv to manage environment variables. Create a .env file in the root directory and add your AWS credentials, OpenAI API and Gemini API keys:
```
AWS_ACCESS_KEY_ID=<your-access-key-id>
AWS_SECRET_ACCESS_KEY=<your-secret-access-key>
AWS_REGION=<your-region>
OPENAI_API_KEY=<your-openai-api-key>
GEMINI_API_KEY=<your-gemini-api-key>
```

Instructions to download data from DVC(Data Version Control)

Make sure to install dvc before dvc pull. You can do this by running the following command:

pip install dvc
pip install 'dvc[gdrive]'

dvc pull # pull all data from dvc remote drive.

Instructions to run experiments with models

To run experiments with the models, use the run_experiments.py script. This script executes tasks defined in the configuration file and evaluates the performance of various models on knowledge graph tasks.

python run_experiments.py --config configs/run_small_datasets.yaml

Optional Arguments

--reevaluate: Reevaluate existing results.
--reevaluate_only: Only reevaluate existing results without running new experiments.
--batch: Enable batch mode for AWS Bedrock models.

For example, to reevaluate results in batch mode:

python run_experiments.py --config configs/run_small_datasets.yaml --reevaluate --batch

Ensure that the configuration file (configs/run_small_datasets.yaml) is properly set up with the desired task, pseudonymizer, and conversion configurations.

Running Experiments with Batch Processing

1. Run experiments in batch mode:

python run_experiments.py --config configs/run_small_datasets.yaml --batch

This will:

Create batch records in batch_data/pending/
Store placeholder results with PENDING: status
Continue processing all tasks

2. Submit batch jobs to Bedrock:

python scripts/process_batches.py \
  --input-bucket $BEDROCK_INPUT_BUCKET \
  --output-bucket $BEDROCK_OUTPUT_BUCKET

3. Collect and process results:

python scripts/collect_results.py

Run this periodically to:

Check job status
Download completed results
Update task result files
Move completed batches to archive

Instructions to generate new data from KG

To generate new data from the knowledge graph, follow these steps:

1: Construct Base Datasets

Use the construct_base_datasets.py script to create base datasets for knowledge graph tasks. This script loads the knowledge graph and generates datasets based on the task configurations.

Run the script with the following command:

python construct_base_datasets.py --config configs/construct_base_datasets_small.yaml

Ensure that the configuration file (configs/construct_base_datasets_small.yaml) is properly set up with the desired task configurations.

2: Construct Formatted Datasets

After constructing the base datasets, use the construct_formatted_datasets.py script to format the datasets for specific tasks. This script processes the base datasets and applies formatting based on the conversion and pseudonymizer configurations.

Run the script with the following command:

python construct_formatted_datasets.py --config configs/construct_formatted_datasets_small.yaml

Ensure that the configuration file (configs/construct_formatted_datasets_small.yaml) is properly set up with the desired conversion and pseudonymizer configurations.

Directory Structure

.
├── batch_data/
│   ├── pending/      # Batches waiting to be submitted
│   ├── submitted/    # Batches currently processing
│   └── completed/    # Processed batches with results
├── benchmark_data/   # Task data and results
├── configs/         # Configuration files
├── llm/            # LLM provider implementations
├── scripts/        # Utility scripts
└── tasks/          # Task implementations

Batch Processing Flow

Accumulation Phase
- BatchBedrock provider accumulates requests
- Creates JSONL files with proper format
- Stores metadata for tracking
Submission Phase
- Uploads records to S3
- Creates Bedrock batch jobs
- Tracks job ARNs and status
Collection Phase
- Monitors job completion
- Downloads and processes results
- Updates original task results

Configuration

The batch processing behavior can be configured through environment variables or command line arguments:

BATCH_SIZE: Maximum records per batch (default: 100)
BATCH_TIMEOUT: Minutes before starting new batch (default: 60)
AWS_DEFAULT_REGION: AWS region for Bedrock
BEDROCK_INPUT_BUCKET: S3 bucket for input data
BEDROCK_OUTPUT_BUCKET: S3 bucket for output data
BEDROCK_BATCH_ROLE_ARN: IAM role ARN for batch jobs

Error Handling

The system includes robust error handling:

Failed jobs are marked and preserved
Partial results are processed when available
Automatic retries for transient failures
Detailed logging for debugging

Monitoring

Monitor batch processing status:

Check batch_data/*/metadata.json files
View AWS Bedrock console
Monitor S3 buckets
Check task result files

Best Practices

Use appropriate batch sizes (100-1000 records)
Monitor costs and completion times
Regularly collect results
Maintain S3 bucket lifecycle policies
Review and archive completed batches

Troubleshooting

Common issues and solutions:

Missing Results: Check job status and S3 paths
Failed Jobs: Review CloudWatch logs
Stuck Jobs: Check IAM permissions
S3 Errors: Verify bucket permissions

Contributing

Fork the repository
Create a feature branch
Submit a pull request

License

MIT License - See LICENSE file for details

Name		Name	Last commit message	Last commit date
Latest commit History 164 Commits
.dvc		.dvc
analysis		analysis
benchmark_data		benchmark_data
configs		configs
data/countries		data/countries
figs		figs
llm		llm
samplers		samplers
scripts		scripts
tasks		tasks
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
construct_base_datasets.py		construct_base_datasets.py
construct_formatted_datasets.py		construct_formatted_datasets.py
environment.yml		environment.yml
kg_builder.py		kg_builder.py
pseudonymizer.py		pseudonymizer.py
requirements.txt		requirements.txt
run_experiments.py		run_experiments.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs

Requirements

Citation

Updates

Requirements

Setup

Instructions to download data from DVC(Data Version Control)

Instructions to run experiments with models

Optional Arguments

Running Experiments with Batch Processing

1. Run experiments in batch mode:

2. Submit batch jobs to Bedrock:

3. Collect and process results:

Instructions to generate new data from KG

1: Construct Base Datasets

2: Construct Formatted Datasets

Directory Structure

Batch Processing Flow

Configuration

Error Handling

Monitoring

Best Practices

Troubleshooting

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs

Requirements

Citation

Updates

Requirements

Setup

Instructions to download data from DVC(Data Version Control)

Instructions to run experiments with models

Optional Arguments

Running Experiments with Batch Processing

1. Run experiments in batch mode:

2. Submit batch jobs to Bedrock:

3. Collect and process results:

Instructions to generate new data from KG

1: Construct Base Datasets

2: Construct Formatted Datasets

Directory Structure

Batch Processing Flow

Configuration

Error Handling

Monitoring

Best Practices

Troubleshooting

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages