Add multilingual task suites#2
Conversation
9811252 to
709ac24
Compare
…tions Refactor existing multilingual tasks and add new ones to match the English task conventions (BPB merged into CF, explicit few-shot counts, English explicitly excluded from all multilingual task registrations). Refactored: - mlmm_arc_challenge: cf+mcf variants, 26 langs, hf_revision pinned - global_mmlu: cf+mcf variants, 33 langs (English removed), both formulations - mlmm_hellaswag: cf only, 33 langs, few_shots_split=train added - mgsm: :gen suffix, generation_size=512, both expr_gold + multilingual_quasi_em New tasks: - global_mmlu_lite: CohereLabs/Global-MMLU-Lite, 17 langs, cf+mcf - mmlu_prox (multilingual): li-lab/MMLU-ProX, 28 langs, 10-option cf+mcf - mmlu_prox (English): li-lab/MMLU-ProX English subset in tasks/tasks/ - wmt24pp: google/wmt24pp, 24 lang pairs × 2 directions, 0-shot gen Tooling: - scripts/multilingual_aggregate.py: cross-language average post-processor - TASK_NAMING_Multilingual.md: naming conventions and language inventory doc add comet22 metrics
709ac24 to
9cf5ef8
Compare
| "choices": [str(line["answer_number"])], | ||
| }, | ||
| ), | ||
| hf_repo="juletxara/mgsm", |
There was a problem hiding this comment.
Let's switch to https://huggingface.co/datasets/CohereLabs/global-mgsm
It has more languages + maybe they cleaned it, can you update accordingly? so replacing line["answer_number"] -> line["answer"] + extend _LANGUAGES + maybe smth else i'm missing
| """ | ||
| source = (doc.specific or {}).get("source_text", "") | ||
| golds = as_list(doc.get_golds()) | ||
| return COMETCorpusMetricInput(source=source, hyp=model_response.final_text, ref=golds) |
There was a problem hiding this comment.
handle case when preds is a list (this is actually what typing downstream assumes)
| return COMETCorpusMetricInput(source=source, hyp=model_response.final_text, ref=golds) | |
| preds = model_response.final_text | |
| if len(preds) > 1: | |
| logger.warning("Multiple predictions present, keeping only the first prediction (for COMET).") | |
| return COMETCorpusMetricInput(source=source, hyp=preds[0], ref=golds) |
| MultilingualQuasiExactMatchMetric(language, "full"), | ||
| ], | ||
| stop_sequence=("\n",), | ||
| stop_sequence=["\n"], |
There was a problem hiding this comment.
I would remove "\n" because sometimes model makes linebreaks between reasoning
instead put ["Question:", "Answer:"] in all _LANGUAGES to stop generation when model starts looping
| generation_size=25, | ||
| generation_size=512, | ||
| metrics=[ | ||
| Metrics.expr_gold_metric, |
There was a problem hiding this comment.
This metric is for English-only and I tested it and out-of-the-box it worked only for like half of languages (i don't remember how i fixed it). It may or may not extract correctly, we need to test it and fix if needed
| _arc_adapter, | ||
| formulation=formulation, | ||
| ), | ||
| hf_repo="jon-tow/okapi_arc_challenge", |
There was a problem hiding this comment.
maybe not for this PR but in general, we should extend arc + other tasks to cover at least core languages + major maybe too https://huggingface.co/collections/Eurolingua/evaluation-suite
| TRANSLATION_LITERALS[_language].question_word = "question" | ||
| TRANSLATION_LITERALS[_language].answer = "answer" |
There was a problem hiding this comment.
can you fix it to have proper translations instead of english fallback? At least for Korean and Lithuanian
| hf_avail_splits=("dev", "devtest"), | ||
| evaluation_splits=("devtest",), | ||
| few_shots_split="dev", | ||
| few_shots_select="random_sampling_from_train", |
There was a problem hiding this comment.
remove for clarity (there is no "train" split)
|
also please add this fix: 9649aff |
see TASK_NAMING_Multilingual.md