Skip to content

nazmus-ashrafi/multiagent_vs_debugger

Repository files navigation

Installation

To install the required dependencies, run:

pip install -r requirements.txt

Generation Command Example

Run the following command to generate results:

python main.py --dataset humaneval --signature --provider_and_model openai:gpt-3.5-turbo-0125 --flow basic --range full --output_path evaluation/basic_gpt35_turbo.jsonl

Evaluation Command Example

To evaluate functional correctness, use:

python evaluation/evaluate_functional_correctness.py \
    --problem_file evaluation/data/HumanEval.jsonl.gz \
    --sample_file evaluation/basic_gpt35_turbo.jsonl

Flow Options

Available flow options:

  • basic
  • AC
  • ACT
  • debugger
  • ac_debugger
  • act_debugger

Dataset (problem_file) Options

Available datasets:

  • HumanEval.jsonl.gz
  • HumanEvalPlus.jsonl.gz

LLM Options

Table summarizing the LLMs utilized in this study is presented below.

Table summarizing the LLMs

Model endpoints for inference:

Hugging Face Endpoints

  1. HuggingFace:HuggingFaceH4/zephyr-7b-beta
  2. HuggingFace:Qwen/Qwen2.5-Coder-32B-Instruct
  3. HuggingFace:meta-llama/Meta-Llama-3-8B-Instruct
  4. HuggingFace:Qwen/QwQ-32B-Preview
  5. HuggingFace:microsoft/Phi-3.5-mini-instruct
  6. HuggingFace:mistralai/Mistral-7B-Instruct-v0.2

Deepseek (Requires --api_key)

  1. deepseek:deepseek-chat

OpenAI

  1. openai:gpt-3.5-turbo-0125
  2. openai:gpt-4o-mini
  3. openai:gpt-4o

Anthropic (Requires --api_key)

  1. anthropic:claude-3-haiku-20240307
  2. anthropic:claude-3-5-sonnet-20241022
  3. anthropic:claude-3-5-haiku-20241022

Groq

  1. groq:llama-3.3-70b-versatile
  2. groq:llama-3.1-8b-instant
  3. groq:gemma2-9b-it
  4. groq:mixtral-8x7b-32768

Vertex

  1. vertex:gemini-2.0-flash-exp
  2. vertex:gemini-1.0-pro

Acknowledgement

Our implementation adapts code from LDB and prompt ideas from both LDB and Self-collaboration Code Generation via ChatGPT. We thank them for their high-quality open source code!

About

https://arxiv.org/abs/2505.02133 Enhancing LLM Code Generation: A Framework for Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency. Published in IEEE Xplore as part of the 2025 IEEE 19th International Conference on AICT.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages