python3 -m venv .venv
source .venv/bin/activate
# For GPU runs, prefer a CUDA-enabled PyTorch wheel:
pip install torch --index-url https://download.pytorch.org/whl/cu124
pip install -r requirements.txtGPU check:
python -c "import torch; print('cuda_available=', torch.cuda.is_available()); print('torch_cuda=', torch.version.cuda)"Optional (only if you still want Ollama backend):
ollama pull gemma4:e2bSmoke test (5 samples):
Hugging Face backend (default):
python run_pubmedqa_eval.py --backend hf --hf_model_id google/gemma-2-2b-it --hf_device auto --config pqa_labeled --split train --limit 5Ollama backend:
python run_pubmedqa_eval.py --backend ollama --model gemma4:e2b --config pqa_labeled --split train --limit 5Larger run (4000 samples):
python run_pubmedqa_eval.py --backend hf --hf_model_id google/gemma-2-2b-it --hf_device auto --config pqa_artificial --split train --limit 4000Optional custom output path:
python run_pubmedqa_eval.py --limit 5 --output_csv outputs/my_predictions.csvpython analyze_results.py --input_csv outputs/my_predictions.csv --output_dir outputs --device autoArtifacts:
outputs/confusion_matrix.csvoutputs/per_sample_scores.csvoutputs/analysis_summary.json
- Prompt includes both
questionandcontext. final_decisionis normalized toyes/no/maybe.long_answeris compared using ROUGE-L, BERTScore, and embedding distance.