A research toolkit to generate, execute, and evaluate LLM-generated
unit tests against the requests
Python library.
This project supports:
- Generating unit tests using an LLM (Gemini) with different prompts.
- Evaluating syntactic, execution, and assertion correctness of generated tests.
- Measuring line and branch coverage against the
local
requestssource code. - Measuring the quality of generated tests regarding security aspects.
- Comparing LLM-generated tests against existing human-written tests.
Important design choice: The
requestslibrary itself is not vendored in this repository. You must provide a local clone and install it in editable mode.
git clone <repo-url>
cd evaluatingLLMpython3 -m venv venv
source venv/bin/activateVerify:
which pythonOutput should point to:
.../evaluatingLLM/venv/bin/python
Install the core tooling used by the evaluation framework:
pip install --upgrade pip
pip install pytest coverage bandit ruff google-genai jupyterlab ipykernel pandas matplotlib seaborn plotly Verify:
pip listYou should see (at least): - pytest - coverage - bandit - google-genai
If you want to use the notebook to plot results for the analysis with CSV files, Register the virtual environment as a Jupyter kernel, and start JupyterLab from the repository root:
python -m ipykernel install --user --name evaluatingllm --display-name "evaluatingLLM (venv)"
jupyter lab notebooks/analysis.ipynbCoverage and test execution must run against the local Requests source, not the published PyPI package. This allows:
- Accurate line & branch coverage
- Inspection of executed functions
- Fair comparison between human and LLM-generated tests
Clone Requests inside or next to this project:
git clone https://github.com/psf/requests.git
cd requests
pip install -e .This creates an editable install, meaning:
import requestswill resolve to your local clone.
Verify: Your pip list shows editable installs and locations; confirm requests points to your local clone.
For coverage to measure your local copy of requests, make sure the local Requests code is the one importable (editable install or PYTHONPATH pointing to the clone).
IMPORTANT: Install the
requestsdevelopment dependencies to run therequeststests`cd requests pip install -r requirements-dev.txt
pip install requestsThis installs Requests into site-packages and breaks coverage
measurement against local source code.
Use only if you are not measuring coverage.
This project uses the google-genai client to generate tests.
export GOOGLE_API_KEY="your_api_key_here"(Optional) To make it persistent, add it to your shell config:
echo 'export GOOGLE_API_KEY="your_api_key_here"' >> ~/.zshrc
source ~/.zshrcLLM-generated tests may: execute arbitrary Python code, access the filesystem and attempt network calls.
Docker provides: filesystem isolation, optional network isolation and a clean, reproducible Python environment.
Test generation (LLM calls) can be run locally.
Test execution and evaluation is recommended to be run in Docker.
- macOS, Linux, or Windows
- Docker Desktop installed and running
- Can use the example Dockerfile provided in the repository
Build the Docker Image (one-time) from the project root:
docker build -t evaluatingllm-eval -f docker/Dockerfile .Make the Docker wrapper script executable:
chmod +x eval/scripts/run_in_docker.shRun any evaluation command via:
./eval/scripts/run_in_docker.sh <command>All results are written to eval/results/ on your host system.
./eval/scripts/run_in_docker.sh python eval/scripts/evaluate_strategy_correctness.py --strategy P0 --csv./eval/scripts/run_in_docker.sh python eval/scripts/evaluate_strategy_coverage.py --strategy P0 --csvBy default, Docker runs with no network access.
To enable network access explicitly:
./eval/scripts/run_in_docker.sh --net python eval/scripts/evaluate_strategy_correctness.py --strategy P0- Your local Python virtual environment is not used inside Docker.
- The container installs the PyPI requests module for dependencies. The wrapper script exports the PYTHONPATH to point to your local requests clone.
- Docker is the recommended execution mode for this project.
- Build LLM prompts from
eval/functions/functions_to_test.jsonand generate test files via Gemini (google-genai). - Outputs are written to
eval/tests/generated_tests/<STRATEGY>/<FUNCTION>/.
- Active virtualenv and basic tools installed (pytest/coverage are not required for generation).
- google-genai installed (the script imports
google.genai). - Set GEMINI_API_KEY in your environment (the script reads this to create a Gemini client).
- prompt-strategy: one of P0, P1, P2, P3 (required).
- model-name: Gemini model name (default: gemini-2.5-flash).
- function-name: if set, only generate for that function listed in functions_to_test.json.
- print-prompt: build and print the prompt only (no LLM call, no output files).
#Generate P1 tests for all functions in functions_to_test.json:
python eval/scripts/generate.py --prompt-strategy P1
# Generate P2 (few-shot) tests for a single function:
python eval/scripts/generate.py --prompt-strategy P2 --function-name get_auth_from_url
# Print prompts for one function:
python eval/scripts/generate.py --prompt-strategy P1 --function-name parse_headers --print-prompt
- P3 is a multi-step self-refine flow (step1/2/3). The script prints step1 prompts when --print-prompt is used; full P3 requires live model calls.
- Be mindful of API quotas and token usage when running bulk generation.
- Prompts and function metadata come from eval/prompts/ and eval/functions/functions_to_test.json. Adjust those files to change what is generated.
- Run an interactive agent session where the LLM can autonomously read source code, navigate the repository, and generate test files.
- Ideal for complex strategies (like P3) or when you want the agent to explore the codebase before writing tests.
- Runs inside a secure Docker container (
evaluatingllm-cli-2) with the Gemini CLI tool installed.
- Docker Desktop installed and running.
GEMINI_API_KEYexported in your environment.
This is a separate image from the evaluation image. Build it once:
docker build -t evaluatingllm-cli-2 -f docker/GeminiCLI.Dockerfile .Run the provided wrapper script:
./eval/scripts/run_gemini_cli.shThis will:
- Mount your current repository to
/workspace. - Mount a dummy volume over
/workspace/venvto protect your local environment. - Drop you into an interactive
geminishell session.
Once the CLI starts, it waits for input. You should provide one of the pre-defined strategy prompts located in eval/prompts/CLI/.
Workflow:
- Open a prompt file locally (e.g.,
eval/prompts/CLI/CLI_P0.txt). - Copy the entire content.
- Paste it into the running Gemini CLI session.
- The agent will parse the instructions and begin the test generation task for the target functions.
- Check generated test files for syntax, count asserts, run pytest on each file.
- Classify results (pass / assertion failure / execution error).
- Optionally write a CSV summary.
- Human-readable table printed to stdout.
- strategy : required, one of P0,P1,P2,P3
- tests : optional, function name for the tests (function/class folder inside the strategy dir). If omitted the whole strategy folder is evaluated.
- csv : optional flag. When provided the script will create CSV at
eval/results/correctness/<strategy>/<tests_or_strategy>/results_<tests_or_strategy>.cs
#Run on an entire strategy:
python eval/scripts/evaluate_strategy_correctness.py --strategy P0 --csv
#Run on a single function folder inside a strategy:
python eval/scripts/evaluate_strategy_correctness.py --strategy P0 --tests get_auth_from_url --csv
- Run line+branch coverage for a group of tests.
- The script creates a temporary or persistent
.coveragedata file and runs:coverage run --branch --data-file=<data_file> --source=<sut_root> -m pytest <tests>
If--requests-functionsis used, the script passes explicit pytest nodeids (built by build_requests_functions_pytest_args()) instead of a test directory. - Recommended to run this script once per strategy folder. The script is not designed to measure coverage for a single test file.
- Optionally outputs a CSV file where one row has measurements for the whole test folder.
- strategy (required unless --requests) : e.g. P0, P1, ...
- tests (optional) : tests folder under the strategy to run (e.g. get_auth_from_url)
- requests-all (flag) : runs all the tests under requests repo at <project_root>/requests/tests
- requests-functions (flag) : runs only the example reuqests library test cases that were listed on the functions_to_test.json. For fair coverage comparison between generated LLM tests and official requests tests.
- sut-root (optional) : path to SUT root, default requests/src/requests
- label (optional): friendly label used in CSV/JSON filenames
- csv (flag) : write CSV to eval/results/coverage/<strategy_or_requests>/coverage_results_.csv
- json-dir (flag) : keep coverage JSON/.coverage under eval/results/coverage/<strategy_or_requests>/json_results/
#Coverage for LLM generated P0 tests cases for a specific requests function called _basic_auth_str
python ./eval/scripts/evaluate_strategy_coverage.py --strategy P0 --tests _basic_auth_str --csv --json-dir
#Coverage for all LLM generated P0 tests
python ./eval/scripts/evaluate_strategy_coverage.py --strategy P0 --csv
#Coverage for all requests tests, some are not relevant for the project
python ./eval/scripts/evaluate_strategy_coverage.py --requests-all --csv
#Coverage for all relevant requests functions
python ./eval/scripts/evaluate_strategy_coverage.py --requests-functions --csv --json-dir
#Coverage for one specific relevant function from requests
python ./eval/scripts/evaluate_strategy_coverage.py --requests-functions --name get_auth_from_url --csv --json-dir
- Keep
functions_to_test.jsonnodeids accurate (project-root relative). - Use editable install (
pip install -e /path/to/requests) or set PYTHONPATH so the source under test is the importablerequestspackage. - Use
--json-dirwhen you want the JSON and .coverage artifacts for deeper debugging or to build an HTML coverage report later.
Note: No automated script is provided for this step.
- Tool: Snyk CLI (specifically
snyk codefor SAST). - Execution: Run Snyk against the generated test files and output to JSON.
snyk code test eval/tests/generated_tests/<STRATEGY>/<FUNCTION>/test_file.py --json > output.json
- Analysis:
- The JSON results were converted to CSV format (via an ad-hoc script, not included).
- Results are analyzed and visualized using the
security_analysis.ipynbnotebook in the project root.
- under eval/finak_approaches, one can find all tried out scenarios. Exact used prompts and generated tests are stored under the scenario folder.
venv/,.coverage*,htmlcov/, and result artifacts should be ignored via.gitignore- The Requests repo should not be committed inside this repository
- The scripts assume macOS shell paths in examples; adjust activate/unlink commands for other shells/OSes.
- Keep functions_to_test.json nodeids accurate (path portion must point to existing test files). build_requests_functions_pytest_args() resolves/validates nodeid file paths and will raise if the file does not exist.