A curated dataset of 500 websites annotated for the accuracy of their content (labels: Correct, Incorrect, Partially Correct), plus simple visualization and evaluation tooling.
What's included
website_dataset.json: the dataset (each entry includesid,url,topic,label, andreasoning).visualize.html: lightweight interactive charts for exploring label and topic distributions.eval.py: example evaluation harness that queries a web-enabled LLM and saves outputs tomodel_behavior_outputs.json.model_behavior_outputs.json: example outputs from a model evaluation run.
Each item in website_dataset.json contains:
id(string): unique identifierurl(string): source URLtopic(string): subject/domainlabel(string): one ofCorrect,Incorrect,Partially Correctreasoning(string): human explanation for the assigned label
The easiest way is to serve the repository and open visualize.html in a browser (most browsers block local file access for JSON):
python -m http.server 8000
# then open http://localhost:8000/visualize.htmleval.py is an example script that:
- loads
website_dataset.json - constructs prompts for each item
- calls an LLM (via NVIDIA/Tavily tool bindings) to fetch web evidence and generate a response
- writes results to
model_behavior_outputs.json
Prerequisites and notes:
- The script expects two API keys as environment variables:
NVIDIA_API_KEYandTAVILY_API_KEY. - Install required Python packages used in
eval.pybefore running (the script useslangchain_core,langchain_nvidia_ai_endpoints, andlangchain_tavilybindings). - Run the script with:
python eval.pyOutputs from a run are saved to model_behavior_outputs.json in the repository root.
This work is licensed under MIT License.