https://arxiv.org/abs/2504.07087
This repository contains a benchmarking system for testing LLM performance on knowledge graph tasks, with support for AWS Bedrock batch processing.
- Python 3.8 or higher
- See
requirements.txtfor the list of dependencies.
@article{Markowitz2025KGLLMBenchAS, title={KG-LLM-Bench: A Scalable Benchmark for Evaluating LLM Reasoning on Textualized Knowledge Graphs}, author={Elan Markowitz and Krupa Galiya and Greg Ver Steeg and A. G. Galstyan}, journal={}, year={2025}, volume={abs/2504.07087}, url={https://api.semanticscholar.org/CorpusID:277634268} }
KG-LLM-Bench won Best Resource Paper at the 4th Workshop on Knowledge-Augmented NLP at NAACL 2025
- Python 3.8 or higher
- See
requirements.txtfor the list of dependencies.
Follow these steps to set up the project:
-
Clone the Repository
Clone this repository to your local machine:git clone https://github.com/Elanmarkowitz/kg-llm-bench.git cd kg-llm-bench -
Create a Virtual Environment (Optional)
Create and activate a virtual environment:python3 -m venv venv source venv/bin/activate # On macOS/Linux venv\Scripts\activate # On Windows
-
Install Dependencies
Install the required Python packages:pip install -r requirements.txt
-
Set Up Environment Variables
This project usespython-dotenvto manage environment variables. Create a.envfile in the root directory and add your AWS credentials, OpenAI API and Gemini API keys:AWS_ACCESS_KEY_ID=<your-access-key-id> AWS_SECRET_ACCESS_KEY=<your-secret-access-key> AWS_REGION=<your-region> OPENAI_API_KEY=<your-openai-api-key> GEMINI_API_KEY=<your-gemini-api-key>
Make sure to install dvc before dvc pull. You can do this by running the following command:
pip install dvc
pip install 'dvc[gdrive]'
dvc pull # pull all data from dvc remote drive.To run experiments with the models, use the run_experiments.py script. This script executes tasks defined in the configuration file and evaluates the performance of various models on knowledge graph tasks.
python run_experiments.py --config configs/run_small_datasets.yaml--reevaluate: Reevaluate existing results.--reevaluate_only: Only reevaluate existing results without running new experiments.--batch: Enable batch mode for AWS Bedrock models.
For example, to reevaluate results in batch mode:
python run_experiments.py --config configs/run_small_datasets.yaml --reevaluate --batchEnsure that the configuration file (configs/run_small_datasets.yaml) is properly set up with the desired task, pseudonymizer, and conversion configurations.
python run_experiments.py --config configs/run_small_datasets.yaml --batchThis will:
- Create batch records in
batch_data/pending/ - Store placeholder results with
PENDING:status - Continue processing all tasks
python scripts/process_batches.py \
--input-bucket $BEDROCK_INPUT_BUCKET \
--output-bucket $BEDROCK_OUTPUT_BUCKETpython scripts/collect_results.pyRun this periodically to:
- Check job status
- Download completed results
- Update task result files
- Move completed batches to archive
To generate new data from the knowledge graph, follow these steps:
Use the construct_base_datasets.py script to create base datasets for knowledge graph tasks. This script loads the knowledge graph and generates datasets based on the task configurations.
Run the script with the following command:
python construct_base_datasets.py --config configs/construct_base_datasets_small.yamlEnsure that the configuration file (configs/construct_base_datasets_small.yaml) is properly set up with the desired task configurations.
After constructing the base datasets, use the construct_formatted_datasets.py script to format the datasets for specific tasks. This script processes the base datasets and applies formatting based on the conversion and pseudonymizer configurations.
Run the script with the following command:
python construct_formatted_datasets.py --config configs/construct_formatted_datasets_small.yamlEnsure that the configuration file (configs/construct_formatted_datasets_small.yaml) is properly set up with the desired conversion and pseudonymizer configurations.
.
├── batch_data/
│ ├── pending/ # Batches waiting to be submitted
│ ├── submitted/ # Batches currently processing
│ └── completed/ # Processed batches with results
├── benchmark_data/ # Task data and results
├── configs/ # Configuration files
├── llm/ # LLM provider implementations
├── scripts/ # Utility scripts
└── tasks/ # Task implementations
-
Accumulation Phase
- BatchBedrock provider accumulates requests
- Creates JSONL files with proper format
- Stores metadata for tracking
-
Submission Phase
- Uploads records to S3
- Creates Bedrock batch jobs
- Tracks job ARNs and status
-
Collection Phase
- Monitors job completion
- Downloads and processes results
- Updates original task results
The batch processing behavior can be configured through environment variables or command line arguments:
BATCH_SIZE: Maximum records per batch (default: 100)BATCH_TIMEOUT: Minutes before starting new batch (default: 60)AWS_DEFAULT_REGION: AWS region for BedrockBEDROCK_INPUT_BUCKET: S3 bucket for input dataBEDROCK_OUTPUT_BUCKET: S3 bucket for output dataBEDROCK_BATCH_ROLE_ARN: IAM role ARN for batch jobs
The system includes robust error handling:
- Failed jobs are marked and preserved
- Partial results are processed when available
- Automatic retries for transient failures
- Detailed logging for debugging
Monitor batch processing status:
- Check
batch_data/*/metadata.jsonfiles - View AWS Bedrock console
- Monitor S3 buckets
- Check task result files
- Use appropriate batch sizes (100-1000 records)
- Monitor costs and completion times
- Regularly collect results
- Maintain S3 bucket lifecycle policies
- Review and archive completed batches
Common issues and solutions:
- Missing Results: Check job status and S3 paths
- Failed Jobs: Review CloudWatch logs
- Stuck Jobs: Check IAM permissions
- S3 Errors: Verify bucket permissions
- Fork the repository
- Create a feature branch
- Submit a pull request
MIT License - See LICENSE file for details