andyzorigin/data_contamination
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
Usage:
python compute_contamination_metrics.py --input-data <input_data> --scenario-data <scenario_data> --output-stats <output_stats> --input-format <input_format>
For instance, you can call this with The Pile, e.g. have:
input_data = 00.jsonl (download https://pile.eleuther.ai/)
scenario_data = (example included with repo, but can use HELM to generate)
output_stats = arbitrary output file name, e.g. "output_stats"
input_format = the_pile