To install the required dependencies, run:
pip install -r requirements.txtRun the following command to generate results:
python main.py --dataset humaneval --signature --provider_and_model openai:gpt-3.5-turbo-0125 --flow basic --range full --output_path evaluation/basic_gpt35_turbo.jsonlTo evaluate functional correctness, use:
python evaluation/evaluate_functional_correctness.py \
--problem_file evaluation/data/HumanEval.jsonl.gz \
--sample_file evaluation/basic_gpt35_turbo.jsonlAvailable flow options:
basicACACTdebuggerac_debuggeract_debugger
Available datasets:
HumanEval.jsonl.gzHumanEvalPlus.jsonl.gz
Table summarizing the LLMs utilized in this study is presented below.
Model endpoints for inference:
HuggingFace:HuggingFaceH4/zephyr-7b-betaHuggingFace:Qwen/Qwen2.5-Coder-32B-InstructHuggingFace:meta-llama/Meta-Llama-3-8B-InstructHuggingFace:Qwen/QwQ-32B-PreviewHuggingFace:microsoft/Phi-3.5-mini-instructHuggingFace:mistralai/Mistral-7B-Instruct-v0.2
deepseek:deepseek-chat
openai:gpt-3.5-turbo-0125openai:gpt-4o-miniopenai:gpt-4o
anthropic:claude-3-haiku-20240307anthropic:claude-3-5-sonnet-20241022anthropic:claude-3-5-haiku-20241022
groq:llama-3.3-70b-versatilegroq:llama-3.1-8b-instantgroq:gemma2-9b-itgroq:mixtral-8x7b-32768
vertex:gemini-2.0-flash-expvertex:gemini-1.0-pro
Our implementation adapts code from LDB and prompt ideas from both LDB and Self-collaboration Code Generation via ChatGPT. We thank them for their high-quality open source code!
