Multilingual NLP Benchmarks

A comprehensive collection of multilingual benchmarks for evaluating language models, covering both generative and discriminative tasks across dozens of languages.


Introduction

The field of multilingual NLP has experienced explosive growth, particularly with the rise of LLMs. As models like mBERT, XLM-R, LLaMA, Qwen, and Gemini claim multilingual capabilities, robust evaluation benchmarks become essential for measuring progress and identifying gaps.

This comprehensive guide attempts to index the major multilingual benchmarks available for both discriminative tasks (classification, tagging) and generative tasks (question answering, summarization, translation). Each entry includes paper references, dataset links, and key characteristics.

Why Multilingual Benchmarks Matter

According to recent research analyzing over 2,000 multilingual benchmarks published between 2021-2024, English still dominates evaluation despite significant investments in multilingual evaluation.[1] Low-resource languages remain severely underrepresented, and there's a notable gap between benchmark performance and real-world human judgments, particularly for traditional NLP tasks.

On a more practical note, having all these benchmarks archived in one place is of great value. I wish I had a resource like this when I first started working on multilingual NLP, it would have saved countless hours of scattered searching. Multilingual evaluation is at the heart of building better multilingual models, and seeing all these benchmarks together offers a chance to reflect on what we have, identify the gaps that remain, and plot a path forward.

Meta-Benchmarks and Evaluation Suites

These comprehensive benchmarks combine multiple tasks to provide holistic evaluation of multilingual models.

XTREME

PaperHu et al. - "XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization"
ConferenceICML 2020
arXiv2003.11080
Languages40 typologically diverse languages
Tasks9 tasks across 4 categories: Classification (XNLI, PAWS-X), Structured Prediction (POS, NER), QA (XQuAD, MLQA, TyDiQA), Retrieval (Tatoeba, BUCC)
Datasethuggingface.co/datasets/google/xtreme
TypeDiscriminative
DescriptionThe first major comprehensive benchmark for evaluating cross-lingual transfer. Uses zero-shot transfer from English for evaluation.

XTREME-R

PaperRuder et al. - "XTREME-R: Towards More Challenging and Nuanced Multilingual Evaluation"
ConferenceEMNLP 2021
arXiv2104.07412
Languages50 typologically diverse languages
Tasks10 tasks including XCOPA for commonsense reasoning, improved retrieval tasks
Datasetsites.research.google/xtreme
TypeDiscriminative + Some Generative
DescriptionAn improved version of XTREME with more challenging tasks and better language coverage.

XGLUE

PaperLiang et al. - "XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation"
ConferenceEMNLP 2020
arXiv2004.01401
Languages19 languages
Tasks11 tasks covering NLU and NLG: NER, POS, NC, MLQA, XNLI, PAWS-X, QADSM, WPR, QAM, QG, NTG
Datasethuggingface.co/datasets/microsoft/xglue
TypeBoth Discriminative and Generative
DescriptionDistinguished from XTREME by including both understanding and generation tasks, plus real-world scenarios from Bing.

MEGA

PaperAhuja et al. - "MEGA: Multilingual Evaluation of Generative AI"
ConferenceEMNLP 2023
arXiv2303.12528
Languages70 languages
Tasks16 tasks including QA (XQuAD, MLQA, TyDiQA, IndicQA), Sequence Labeling (WikiANN NER, UDPOS), NLG (XL-Sum), RAI (Jigsaw, Wino-MT)
Datasetgithub.com/microsoft/MEGA
TypeBoth
DescriptionSpecifically designed for evaluating generative AI models in multilingual settings.

P-MMEval

PaperZhang et al. - "P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs"
ConferenceEMNLP 2025
arXiv2411.09116
Languages10 languages (English, Chinese, Arabic, Spanish, Japanese, Korean, Thai, French, Portuguese, Vietnamese)
TasksCode generation, knowledge comprehension, mathematical reasoning, logical reasoning, instruction following
Datasethuggingface.co/datasets/Qwen/P-MMEval
TypeBoth
DescriptionEnsures consistent language coverage across all tasks with parallel data.

Natural Language Understanding Benchmarks

XNLI

PaperConneau et al. - "XNLI: Evaluating Cross-lingual Sentence Representations"
ConferenceEMNLP 2018
arXiv1809.05053
Languages15 languages (French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, Urdu + English)
TaskNatural Language Inference (entailment, contradiction, neutral)
Datasethuggingface.co/datasets/facebook/xnli
TypeDiscriminative
DescriptionCrowd-sourced extension of MultiNLI to 14 additional languages. Professional translations ensure quality. Also serves as a 15-way parallel corpus. 5,000 test pairs, 2,500 dev pairs per language (112.5k total annotated pairs).

PAWS-X

PaperYang et al. - "PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification"
ConferenceEMNLP 2019
arXiv1908.11828
Languages7 languages (English, German, Spanish, French, Japanese, Korean, Chinese)
TaskParaphrase identification (binary classification)
Datasethuggingface.co/datasets/google-research-datasets/paws-x
TypeDiscriminative
DescriptionAdversarial paraphrase detection requiring models to understand word order and syntax, not just lexical overlap. 49,401 training pairs + 2,000 dev/test per language.

Belebele

PaperBandarkar et al. - "The Belebele Benchmark: A Parallel Reading Comprehension Dataset in 122 Language Variants"
ConferenceACL 2024
arXiv2308.16884
Languages122 language variants
TaskMultiple-choice machine reading comprehension
Datasethuggingface.co/datasets/facebook/belebele
TypeDiscriminative
DescriptionBased on FLORES-200, with forced annotator alignment ensuring equivalent difficulty across languages. The most extensive parallel reading comprehension benchmark. 900 questions per language (4 multiple-choice answers each).

Question Answering Benchmarks

MLQA

PaperLewis et al. - "MLQA: Evaluating Cross-lingual Extractive Question Answering"
ConferenceACL 2020
arXiv1910.07475
Languages7 languages (English, Arabic, German, Spanish, Hindi, Vietnamese, Chinese)
TaskExtractive Question Answering
Datasethuggingface.co/datasets/facebook/mlqa
TypeGenerative (span extraction)
DescriptionHighly parallel dataset allowing cross-lingual QA evaluation. QA instances parallel between 4 languages on average. 5K+ extractive QA instances per language (12K in English).

XQuAD

PaperArtetxe et al. - "On the Cross-lingual Transferability of Monolingual Representations"
ConferenceACL 2020
arXiv1910.11856
Languages11 languages (10 + English)
TaskExtractive Question Answering
Datasethuggingface.co/datasets/google/xquad
TypeGenerative (span extraction)
DescriptionProfessional human translations of SQuAD v1.1 subset. Good for evaluating zero-shot cross-lingual transfer. 240 paragraphs, 1,190 question-answer pairs.

TyDi QA

PaperClark et al. - "TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages"
JournalTACL 2020
arXiv2003.05002
Languages11 typologically diverse languages (Arabic, Bengali, English, Finnish, Indonesian, Japanese, Kiswahili, Korean, Russian, Telugu, Thai)
TaskInformation-seeking QA
Datasetgithub.com/google-research-datasets/tydiqa
TypeGenerative
DescriptionQuestions written by native speakers who don't know the answer, avoiding priming effects. No translation usedโ€”data collected directly in each language. 204K question-answer pairs.

MKQA

PaperLongpre et al. - "MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering"
JournalTACL 2021
arXiv2003.05002
Languages26 languages
TaskOpen-domain QA
Datasetgithub.com/apple/ml-mkqa
TypeGenerative
DescriptionEvaluates multilingual ODQA systems with parallel questions aligned across languages.

Machine Translation Benchmarks

FLORES-101 / FLORES-200 / FLORES+

PaperGoyal et al. - "The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation"
JournalTACL 2022
arXiv2106.03193
Languages101 languages (FLORES-101), expanded to 200+ (FLORES-200/FLORES+)
TaskMachine Translation
Datasethuggingface.co/datasets/facebook/flores
TypeGenerative
DescriptionHigh-quality professional translations covering diverse topics. Enables many-to-many evaluation. The standard benchmark for multilingual MT, especially low-resource languages. 3,001 sentences from English Wikipedia.

WMT Shared Tasks

OrganizationWorkshop on Machine Translation (annual)
LanguagesVaries by year (typically 15-30 language pairs)
TaskMachine Translation
Websitestatmt.org
TypeGenerative
DescriptionAnnual shared tasks with human evaluation. The gold standard for MT evaluation.

Tatoeba

PaperArtetxe & Schwenk - "Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond"
JournalTACL 2019
arXiv1812.10464
Languages112+ languages
TaskSentence retrieval / Translation
Datasettatoeba.org
TypeBoth
DescriptionCommunity-supported parallel sentence collection. Used for evaluating sentence embeddings and translation models. 1,000 sentences per language.

Sequence Labeling Benchmarks

WikiANN / PAN-X

PaperPan et al. - "Cross-lingual Name Tagging and Linking for 282 Languages"
ConferenceACL 2017
arXivP17-1178
Languages282 languages (176 in processed version)
TaskNamed Entity Recognition (PER, LOC, ORG)
Datasethuggingface.co/datasets/wikiann
TypeDiscriminative
DescriptionAutomatically annotated from Wikipedia using cross-lingual links. While broad in coverage, automatic annotation introduces quality issues for some languages.

Universal NER (UNER)

PaperMayhew et al. - "Universal NER: A Gold-Standard Multilingual Named Entity Recognition Benchmark"
ConferenceNAACL 2024
arXiv2311.09122
Languages13+ languages
TaskNamed Entity Recognition
Datasetuniversalner.org
TypeDiscriminative
DescriptionGold-standard NER annotations on Universal Dependencies treebanks. Addresses quality issues of WikiANN with native speaker annotations.

Universal Dependencies (UD)

PaperNivre et al. - "Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection"
ConferenceLREC 2020
arXiv2004.10643
Languages104+ languages (v2.7: 183 treebanks)
TasksPOS Tagging, Morphological Analysis, Dependency Parsing
Datasetuniversaldependencies.org
TypeDiscriminative
DescriptionThe most comprehensive multilingual treebank collection. Provides cross-linguistically consistent annotation for morphology and syntax.

CoNLL Shared Tasks (NER)

PaperSang & De Meulder - "Language-Independent Named Entity Recognition"
ConferenceCNLL at NAACL 2003
arXiv0306050
LanguagesEnglish, German, Spanish, Dutch
TaskNamed Entity Recognition
Datasetlcg-www.uia.ac.be/conll2003/ner
TypeDiscriminative
DescriptionClassic NER benchmarks still widely used. High-quality human annotations.

Reasoning Benchmarks

XCOPA

PaperPonti et al. - "XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning"
ConferenceEMNLP 2020
arXiv0306050
Languages11 languages (Estonian, Indonesian, Italian, Quechua, Swahili, Tamil, Thai, Turkish, Vietnamese, Chinese + Haitian Creole)
TaskCausal commonsense reasoning
Datasethuggingface.co/datasets/cambridgeltl/xcopa
TypeDiscriminative
DescriptionTests ability to determine causal relationships. Includes resource-poor languages like Quechua and Haitian Creole. 100 validation + 500 test examples per language.

MGSM

PaperShi et al. - "Language Models are Multilingual Chain-of-Thought Reasoners"
ConferenceEMNLP 2022
arXiv2210.03057
Languages10 languages (Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, Thai)
TaskMathematical reasoning (grade-school level)
Datasethuggingface.co/datasets/juletxara/mgsm
TypeGenerative
DescriptionManual translations of GSM8K problems. Evaluates chain-of-thought reasoning across languages. 250 problems per language.

XL-WiC

PaperRaganato et al. - "XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization"
ConferenceEMNLP 2020
arXiv2010.06478
Languages12 languages
TaskWord sense disambiguation
Datasethuggingface.co/datasets/pasinit/xlwic
TypeDiscriminative
DescriptionDetermines if a word has the same sense in two different sentences across languages.

Text Classification Benchmarks

SIB-200

PaperAdelani et al. - "SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects"
ConferenceEACL 2024
arXiv2309.07445
Languages205 languages and dialects
TaskTopic classification (7 categories)
Datasetgithub.com/dadelani/sib-200
TypeDiscriminative
DescriptionBased on FLORES-200. First publicly available NLU dataset for many languages. Shows large performance gaps between high-resource and low-resource languages.

Taxi1500

PaperMa et al. (2023) - "Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages"
ConferenceNAACL 2025
arXiv2305.08487
Languages1,500+ languages
TaskTopic classification
Datasetgithub.com/cisnlp/Taxi1500
TypeDiscriminative
DescriptionUses Parallel Bible Corpus. Largest language coverage but biased to religious domain.

MMS Benchmark

PaperAugustyniak et al. - "Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark"
ConferenceNeurIPS 2023 Datasets and Benchmarks
arXiv2306.07902
Languages27 languages (6 language families)
TaskSentiment classification
Datasetgithub.com/Brand24-AI/mms_benchmark
TypeDiscriminative
DescriptionMost extensive open multilingual sentiment corpus. Datasets queryable by linguistic and functional features.

XLM-T

PaperBarbieri et al. - "XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond"
ConferenceLREC 2022
arXiv2104.12250
Languages8 languages (Arabic, English, French, German, Hindi, Italian, Portuguese, Spanish)
TaskTwitter sentiment analysis
Datasetgithub.com/cardiffnlp/xlm-t
TypeDiscriminative
DescriptionUnified benchmark for cross-lingual Twitter sentiment analysis.

Sentence Retrieval and Similarity

BUCC

PaperZweigenbaum et al. - "Spotting Parallel Sentences in Comparable Corpora"
ConferenceWokshop on Building and Using Comparable Corpora at ACL 2017
arXivW17-2512
Languages5 language pairs with English
TaskParallel sentence identification
TypeDiscriminative
DescriptionShared task for spotting parallel sentences in comparable corpora.

STS (Semantic Textual Similarity)

PaperCer et al. - "Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation"
ConferenceSemEval 2017
arXiv1708.00055
LanguagesVarious (SemEval tasks)
TaskSentence similarity scoring
TypeDiscriminative
DescriptionVarious SemEval shared tasks with multilingual extensions.

Knowledge and MMLU-Style Benchmarks

MMLU (Original English)

PaperHendrycks et al. - "Measuring Massive Multitask Language Understanding"
ConferenceICLR 2021
LanguagesEnglish only (foundation for multilingual versions)
TaskMultiple-choice knowledge questions
Datasetgithub.com/hendrycks/test
TypeDiscriminative
DescriptionThe standard for evaluating knowledge and reasoning. Foundation for many multilingual adaptations. 15,908 questions across 57 subjects.

MMMLU (OpenAI)

LanguagesMultiple languages
Languages15 languages: English, Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, Simplified Chinese.
TaskKnowledge comprehension
Datasethuggingface.co/datasets/openai/MMMLU
TypeDiscriminative
DescriptionOpenAI's official multilingual version of MMLU.

MMLU-ProX

PaperXuan et al. - "MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation"
ConferenceEMNLP 2025
arXiv2503.10497
Languages29 languages
TaskAdvanced reasoning with chain-of-thought
Datasetmmluprox.github.io
TypeBoth
DescriptionBuilt on MMLU-Pro with expert-verified translations. Shows up to 24.3% performance gap between high and low-resource languages. 11,829 questions per language (658 in lite version).

Global-MMLU

PaperSingh et al. - "Global MMLU: Understanding and Addressing Cultural and Linguistic Biases"
ConferenceACL 2025
arXiv2412.03304
Languages42+ languages
TaskKnowledge comprehension
Datasethuggingface.co/datasets/CohereLabs/Global-MMLU
TypeDiscriminative
DescriptionAddresses cultural biases in multilingual evaluation.

Include

PaperRomanou et al. - "INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge"
ConferenceICLR 2024
arXiv2411.19799
Languages44 languages
TaskKnowledge and reasoning
Datasethuggingface.co/datasets/CohereLabs/include-base-44
TypeDiscriminative
DescriptionConstructed from local exam sources in target languages, not translations. 197,243 multiple-choice QA pairs.

Speech and Multimodal Benchmarks

FLEURS

PaperConneau et al. - "FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech"
ConferenceIEEE 2023
arXiv2205.12446
Languages102 languages
TaskASR, Speech Translation, Language ID
Datasethuggingface.co/datasets/google/fleurs
TypeSpeech
DescriptionSpeech version of FLORES for evaluating multilingual speech models.

Fleurs-SLU

PaperShmidt et al. - "Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding"
ConferenceCOLM 2025
arXiv2501.06117
Languages100+ languages
TasksTopic classification (102 langs), QA on spoken paragraphs (92 langs)
Datasethuggingface.co/datasets/WueNLP/sib-fleurs
TypeSpeech + NLU
DescriptionCombines Fleurs with SIB-200 and Belebele for spoken language understanding.

Language-Specific Benchmarks

Many languages now have their own dedicated evaluation suites, modeled after GLUE/SuperGLUE.

๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ซ๐Ÿ‡ท Basque - BasqueGLUE

ConferenceLREC 2022
TasksNER, Intent Classification, Slot Filling, Topic Classification, Sentiment Analysis, Stance Detection, QA/NLI, WiC, Coreference
Datasethuggingface.co/datasets/orai-nlp/basqueGLUE

๐Ÿ‡ง๐Ÿ‡ฌ Bulgarian - bgGLUE

ConferenceACL 2023
TasksNER, POS Tagging, Sentiment, Check-Worthiness, Humor Detection, NLI, Multi-Choice QA, Factuality
Datasethuggingface.co/datasets/bgglue/bgglue

๐Ÿ‡ฉ๐Ÿ‡ช German - SuperGLEBer

ConferenceNAACL 2024
TasksNER, Document Classification, STS, QA
Datasethuggingface.co/datasets/lgienapp/superGLEBer

๐Ÿ‡ญ๐Ÿ‡บ Hungarian - HuLU

ConferenceLREC 2024
TasksCoPA, RTE, SST, WNLI, CommitmentBank, ReCoRD QA
Datasethuggingface.co/datasets/NYTK/HuCoPA

๐Ÿ‡ฎ๐Ÿ‡น Italian - UINAUIL

ConferenceACL 2023 Demo
TasksTextual Entailment, Event Detection, Factuality, Sentiment, Irony, Hate Speech
Datasethuggingface.co/datasets/RiTA-nlp/UINAUIL

๐Ÿ‡ฐ๐Ÿ‡ท Korean - KMMLU

ConferenceNAACL 2025
TasksMulti-choice QA across 45 subjects
Datasethuggingface.co/datasets/HAERAE-HUB/KMMLU

๐Ÿ‡ณ๐Ÿ‡ด Norwegian - NorBench

ConferenceNAACL 2025
TasksPOS, Lemmatization, Parsing, NER, Sentiment, Acceptability, QA, MT, Bias Detection
Datasetgithub.com/ltgoslo/norbench

๐Ÿ‡ต๐Ÿ‡ฑ Polish - KLEJ & LEPISZCZE

ConferenceACL 2020
TasksNER, Sentence Relatedness, Entailment, Cyberbullying, Sentiment, QA, Paraphrase, Abusive Clauses, Political Ads, NLI, POS, Punctuation, Dialogue Acts
Datasetklejbenchmark.com/

๐Ÿ‡ท๐Ÿ‡ด Romanian - LiRo

ConferenceNeurIPS 2021
TasksClassification, NER, MT, Sentiment, POS, Parsing, LM, QA, STS, Debiasing
Datasetlirobenchmark.github.io

๐Ÿ‡ท๐Ÿ‡บ Russian - MERA

ConferenceDialogue 2025
Tasks21 tasks including MathLogicQA, MultiQ, ruMMLU, ruHumanEval, ruEthics
Datasetmera.a-ai.ru/en

๐Ÿ‡ธ๐Ÿ‡ช Swedish - Superlim

ConferenceEMNLP 2023
TasksMultiple Swedish NLU tasks including SweNLI, SweWiC, SweWinograd
Datasethuggingface.co/datasets/AI-Sweden/SuperLim

๐Ÿ‡ป๐Ÿ‡ณ Vietnamese - ViGLUE

ConferenceNAACL 2024
TasksMNLI, QNLI, RTE, SST2, MRPC, QQP, CoLA adaptations
Datasethuggingface.co/datasets/tmnam20/ViGLUE

๐Ÿ‡ณ๐Ÿ‡ฑ Dutch - DUMB

ConferenceEMNLP 2023
TasksPOS, NER, WSD, Pronoun Resolution, Causal Reasoning, NLI, Sentiment, Classification, QA
Datasetgithub.com/wietsedv/dumb

๐Ÿ‡ฉ๐Ÿ‡ฐ Danish - Semantic Reasoning Benchmark

ConferenceLREC 2024
TasksInference, Entailment, Synonymy, Similarity, Relatedness, WiC
Datasetgithub.com/kuhumcst/danish-semantic-reasoning-benchmark

๐Ÿ‡ช๐Ÿ‡ธ Catalan - The Catalan Language CLUB

ConferenceACL 2021
TasksNER, POS, NLI, Classification, QA, STS
Datasethuggingface.co/BSC-LT

Conclusion

This guide has covered the major multilingual benchmarks available as of late 2025. Several key takeaways are:

1. Coverage is improving but uneven: While benchmarks like FLORES-200 and SIB-200 cover 200+ languages, most benchmarks focus on 10-50 languages, with English, Chinese, Spanish, French, and German being most represented.

2. Task diversity: The field has moved beyond simple classification to include reasoning (XCOPA, MGSM), knowledge evaluation (MMLU variants), and complex QA (TyDi QA).

3. The low-resource gap: Performance on low-resource languages consistently lags behind high-resource languages, highlighting the need for continued investment in underrepresented languages.

Last updated: January 2025

Have a benchmark to add? This list is not exhaustive; the multilingual NLP community continues to create new evaluation resources. Consider contributing to this list through GitHub.

References

  1. ^ Wu, M., Wang, W., Liu, S., Yin, H., Wang, X., Zhao, Y., Lyu, C., Wang, L., Luo, W., & Zhang, K. (2025). "The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks." arXiv:2504.15521

Citation


@online{gurgurov2025multilingual,
    title={Multilingual NLP Benchmarks Landscape},
    author={Gurgurov, Daniil},
    year={2026},
    month={January},
    url={https://d-gurgurov.github.io/writing/multilingual-benchmarking.html}
}
                    
โ† Back to writing