A comprehensive collection of multilingual benchmarks for evaluating language models, covering both generative and discriminative tasks across dozens of languages.
Introduction
The field of multilingual NLP has experienced explosive growth, particularly with the rise of LLMs. As models like mBERT, XLM-R, LLaMA, Qwen, and Gemini claim multilingual capabilities, robust evaluation benchmarks become essential for measuring progress and identifying gaps.
This comprehensive guide attempts to index the major multilingual benchmarks available for both discriminative tasks (classification, tagging) and generative tasks (question answering, summarization, translation). Each entry includes paper references, dataset links, and key characteristics.
Why Multilingual Benchmarks Matter
According to recent research analyzing over 2,000 multilingual benchmarks published between 2021-2024, English still dominates evaluation despite significant investments in multilingual evaluation.[1] Low-resource languages remain severely underrepresented, and there's a notable gap between benchmark performance and real-world human judgments, particularly for traditional NLP tasks.
On a more practical note, having all these benchmarks archived in one place is of great value. I wish I had a resource like this when I first started working on multilingual NLP, it would have saved countless hours of scattered searching. Multilingual evaluation is at the heart of building better multilingual models, and seeing all these benchmarks together offers a chance to reflect on what we have, identify the gaps that remain, and plot a path forward.
Meta-Benchmarks and Evaluation Suites
These comprehensive benchmarks combine multiple tasks to provide holistic evaluation of multilingual models.
XTREME
Paper
Hu et al. - "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization"
Crowd-sourced extension of MultiNLI to 14 additional languages. Professional translations ensure quality. Also serves as a 15-way parallel corpus. 5,000 test pairs, 2,500 dev pairs per language (112.5k total annotated pairs).
PAWS-X
Paper
Yang et al. - "PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification"
Adversarial paraphrase detection requiring models to understand word order and syntax, not just lexical overlap. 49,401 training pairs + 2,000 dev/test per language.
Belebele
Paper
Bandarkar et al. - "The Belebele Benchmark: A Parallel Reading Comprehension Dataset in 122 Language Variants"
Based on FLORES-200, with forced annotator alignment ensuring equivalent difficulty across languages. The most extensive parallel reading comprehension benchmark. 900 questions per language (4 multiple-choice answers each).
Question Answering Benchmarks
MLQA
Paper
Lewis et al. - "MLQA: Evaluating Cross-lingual Extractive Question Answering"
Highly parallel dataset allowing cross-lingual QA evaluation. QA instances parallel between 4 languages on average. 5K+ extractive QA instances per language (12K in English).
XQuAD
Paper
Artetxe et al. - "On the Cross-lingual Transferability of Monolingual Representations"
Professional human translations of SQuAD v1.1 subset. Good for evaluating zero-shot cross-lingual transfer. 240 paragraphs, 1,190 question-answer pairs.
TyDi QA
Paper
Clark et al. - "TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages"
Questions written by native speakers who don't know the answer, avoiding priming effects. No translation usedโdata collected directly in each language. 204K question-answer pairs.
MKQA
Paper
Longpre et al. - "MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering"
High-quality professional translations covering diverse topics. Enables many-to-many evaluation. The standard benchmark for multilingual MT, especially low-resource languages. 3,001 sentences from English Wikipedia.
Automatically annotated from Wikipedia using cross-lingual links. While broad in coverage, automatic annotation introduces quality issues for some languages.
Universal NER (UNER)
Paper
Mayhew et al. - "Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark"
Tests ability to determine causal relationships. Includes resource-poor languages like Quechua and Haitian Creole. 100 validation + 500 test examples per language.
MGSM
Paper
Shi et al. - "Language Models are Multilingual Chain-of-Thought Reasoners"
Based on FLORES-200. First publicly available NLU dataset for many languages. Shows large performance gaps between high-resource and low-resource languages.
Taxi1500
Paper
Ma et al. (2023) - "Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages"
Built on MMLU-Pro with expert-verified translations. Shows up to 24.3% performance gap between high and low-resource languages. 11,829 questions per language (658 in lite version).
Global-MMLU
Paper
Singh et al. - "Global MMLU: Understanding and Addressing Cultural and Linguistic Biases"
This guide has covered the major multilingual benchmarks available as of late 2025. Several key takeaways are:
1. Coverage is improving but uneven: While benchmarks like FLORES-200 and SIB-200 cover 200+ languages, most benchmarks focus on 10-50 languages, with English, Chinese, Spanish, French, and German being most represented.
2. Task diversity: The field has moved beyond simple classification to include reasoning (XCOPA, MGSM), knowledge evaluation (MMLU variants), and complex QA (TyDi QA).
3. The low-resource gap: Performance on low-resource languages consistently lags behind high-resource languages, highlighting the need for continued investment in underrepresented languages.
Last updated: January 2025
Have a benchmark to add? This list is not exhaustive; the multilingual NLP community continues to create new evaluation resources. Consider contributing to this list through GitHub.
References
^ Wu, M., Wang, W., Liu, S., Yin, H., Wang, X., Zhao, Y., Lyu, C., Wang, L., Luo, W., & Zhang, K. (2025). "The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks." arXiv:2504.15521