Add multilingual task suites by rakkit · Pull Request #2 · SDLAML/lighteval

rakkit · 2026-03-28T15:04:45Z

see TASK_NAMING_Multilingual.md

…tions Refactor existing multilingual tasks and add new ones to match the English task conventions (BPB merged into CF, explicit few-shot counts, English explicitly excluded from all multilingual task registrations). Refactored: - mlmm_arc_challenge: cf+mcf variants, 26 langs, hf_revision pinned - global_mmlu: cf+mcf variants, 33 langs (English removed), both formulations - mlmm_hellaswag: cf only, 33 langs, few_shots_split=train added - mgsm: :gen suffix, generation_size=512, both expr_gold + multilingual_quasi_em New tasks: - global_mmlu_lite: CohereLabs/Global-MMLU-Lite, 17 langs, cf+mcf - mmlu_prox (multilingual): li-lab/MMLU-ProX, 28 langs, 10-option cf+mcf - mmlu_prox (English): li-lab/MMLU-ProX English subset in tasks/tasks/ - wmt24pp: google/wmt24pp, 24 lang pairs × 2 directions, 0-shot gen Tooling: - scripts/multilingual_aggregate.py: cross-language average post-processor - TASK_NAMING_Multilingual.md: naming conventions and language inventory doc add comet22 metrics

ofivite · 2026-03-29T12:33:13Z

                "choices": [str(line["answer_number"])],
            },
        ),
        hf_repo="juletxara/mgsm",


Let's switch to https://huggingface.co/datasets/CohereLabs/global-mgsm

It has more languages + maybe they cleaned it, can you update accordingly? so replacing line["answer_number"] -> line["answer"] + extend _LANGUAGES + maybe smth else i'm missing

ofivite · 2026-04-10T15:03:23Z

+        """
+        source = (doc.specific or {}).get("source_text", "")
+        golds = as_list(doc.get_golds())
+        return COMETCorpusMetricInput(source=source, hyp=model_response.final_text, ref=golds)


handle case when preds is a list (this is actually what typing downstream assumes)

Suggested change

return COMETCorpusMetricInput(source=source, hyp=model_response.final_text, ref=golds)

preds = model_response.final_text

if len(preds) > 1:

logger.warning("Multiple predictions present, keeping only the first prediction (for COMET).")

return COMETCorpusMetricInput(source=source, hyp=preds[0], ref=golds)

ofivite · 2026-04-10T16:09:31Z

            MultilingualQuasiExactMatchMetric(language, "full"),
        ],
-        stop_sequence=("\n",),
+        stop_sequence=["\n"],


I would remove "\n" because sometimes model makes linebreaks between reasoning

instead put ["Question:", "Answer:"] in all _LANGUAGES to stop generation when model starts looping

ofivite · 2026-04-10T16:20:55Z

-        generation_size=25,
+        generation_size=512,
        metrics=[
+            Metrics.expr_gold_metric,


This metric is for English-only and I tested it and out-of-the-box it worked only for like half of languages (i don't remember how i fixed it). It may or may not extract correctly, we need to test it and fix if needed

ofivite · 2026-04-10T16:39:41Z

+            _arc_adapter,
            formulation=formulation,
        ),
        hf_repo="jon-tow/okapi_arc_challenge",


maybe not for this PR but in general, we should extend arc + other tasks to cover at least core languages + major maybe too https://huggingface.co/collections/Eurolingua/evaluation-suite

ofivite · 2026-04-10T16:49:45Z

+    TRANSLATION_LITERALS[_language].question_word = "question"
+    TRANSLATION_LITERALS[_language].answer = "answer"


can you fix it to have proper translations instead of english fallback? At least for Korean and Lithuanian

ofivite · 2026-04-10T17:06:42Z

+            hf_avail_splits=("dev", "devtest"),
+            evaluation_splits=("devtest",),
+            few_shots_split="dev",
+            few_shots_select="random_sampling_from_train",


remove for clarity (there is no "train" split)

ofivite · 2026-04-14T10:42:28Z

also please add this fix: 9649aff

rakkit force-pushed the opt_moe_multilingual branch 7 times, most recently from 9811252 to 709ac24 Compare April 4, 2026 03:00

rakkit force-pushed the opt_moe_multilingual branch from 709ac24 to 9cf5ef8 Compare April 4, 2026 14:57

ofivite requested changes Apr 10, 2026

View reviewed changes

rakkit added 7 commits April 19, 2026 13:22

fix msmg

b6f8671

fix bugs

2ae0dbc

update ruler

7a2e52c

fix stuff

c627beb

fix pr#2 comments

9b1dc98

improve mgsm

d444a2c

fix bpb and cf for swarm

f30649b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multilingual task suites#2

Add multilingual task suites#2
rakkit wants to merge 8 commits into
opt_moefrom
opt_moe_multilingual

rakkit commented Mar 28, 2026

Uh oh!

ofivite Mar 29, 2026

Uh oh!

ofivite Apr 10, 2026

Uh oh!

ofivite Apr 10, 2026 •

edited

Loading

Uh oh!

ofivite Apr 10, 2026

Uh oh!

ofivite Apr 10, 2026

Uh oh!

ofivite Apr 10, 2026

Uh oh!

ofivite Apr 10, 2026

Uh oh!

ofivite commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        return COMETCorpusMetricInput(source=source, hyp=model_response.final_text, ref=golds)
+        preds = model_response.final_text
+        if len(preds) > 1:
+            logger.warning("Multiple predictions present, keeping only the first prediction (for COMET).")
+        return COMETCorpusMetricInput(source=source, hyp=preds[0], ref=golds)

		TRANSLATION_LITERALS[_language].question_word = "question"
		TRANSLATION_LITERALS[_language].answer = "answer"

Conversation

rakkit commented Mar 28, 2026

Uh oh!

ofivite Mar 29, 2026

Choose a reason for hiding this comment

Uh oh!

ofivite Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ofivite Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ofivite Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ofivite Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ofivite Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ofivite Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

ofivite commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ofivite Apr 10, 2026 •

edited

Loading