Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

Gili Lior1,2  ·  Tzviel Frostig3  ·  Gabriel Stanovsky2  ·  Matan Eyal1
1Google Research    2The Hebrew University of Jerusalem    3PhaseV Trials
📄 Paper </> Code soon 🗂 Data soon ❞ BibTeX

Parallel multilingual benchmarks test the same items across many languages, yet today they are scored as if every (item, language) pair were unrelated. Multilingual-IRT models that parallel structure directly, and its fitted parameters turn three chronic benchmark problems into three practical tools.

Problem · cost scales with languages

Efficient evaluation

Fit on a fraction of the predictions tensor and impute the rest, with no need to run every LLM on every item in every language.

11–16% lower BCE than the strongest accuracy baseline; 60% of instances can go unevaluated at negligible loss.
Problem · silent translation errors

Translation-error detection

Per-language difficulty shifts flag items that became unexpectedly harder (or easier) in a given language.

Errors surfaced across all 28 non-English languages, where accuracy baselines pile up in just three.
Problem · culture vs. general knowledge

Culture-specific items

Splitting discriminability into content vs. language components exposes items whose answers depend on cultural context.

26.1% precision at top-2K vs. 19.8% base rate, surfacing items accuracy methods miss.

Abstract

Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation introduces errors that are easily missed at scale, and some items conflate general and culture-specific knowledge. We address all three with a unified statistical framework, Multilingual-IRT, which extends Item Response Theory with per-language difficulty deviations, split discriminability separating content from language effects, and per-language ability residuals.

Fitting Multilingual-IRT on 25 LLMs across 29 languages of MMLU-Pro-X, we show that its fitted parameters support three practical applications: predicting unobserved (item, LLM, language) instances with 11–16% lower binary cross-entropy than the strongest accuracy-based baseline, surfacing candidate translation errors distributed across all 28 non-English languages (whereas accuracy-based baselines concentrate detections in a few languages), and recovering culture-specific items that accuracy-based baselines miss.

The model in one line

Multilingual-IRT extends the classic 2PL response model with three components, each capturing a structural fact about parallel multilingual benchmarks:

logit(pijl)  =  abaseiθj  +  alangi (εj)l  −  (bi + dil)
item i · LLM j · language l  ·  εj ~ N(0, σ²Rlng) with a learned cross-lingual correlation matrix
dil

Per-language difficulty deviation. Translation can make the same item harder in one language than another. A large positive dil is a red flag for a translation error.

εj , Rlng

Correlated ability residuals. An LLM's proficiency deviates from its overall ability per language, with deviations coupled across languages through a learned correlation matrix.

abasei / alangi

Split discriminability. How much an item separates LLMs by content, versus by language. A high ratio alang/abase signals culture-specific knowledge.

bi, ai shared item parameters across languages

Evaluate a fraction, predict the rest

HowEvaluating 25 LLMs on 11,759 items in 29 languages means filling an ~8.5M-entry predictions tensor. Multilingual-IRT fits on a sampled fraction ρ and predicts the rest, and fitting it costs negligibly compared to LLM inference.

FindingAt every observation fraction, Multilingual-IRT beats all accuracy-based baselines, and reaches near-full accuracy with only 40% of instances observed.

Line charts of binary cross-entropy and ROC-AUC on unseen instances as a function of observed fraction, with Multilingual-IRT outperforming all baselines.
Prediction over unseen (item, LLM, language) instances. (a) Binary cross-entropy and (b) ROC-AUC versus the fraction of observed instances ρ. Multilingual-IRT (red) achieves 11–16% lower BCE than the strongest accuracy-based baseline at every ρ.
dil per-language difficulty deviation

Find diverse translation errors

HowRanking instances by the standardized difficulty shift s(i,l) = dil/SE(dil) surfaces candidate translation errors for targeted review. An LLM-judge audit (validated by human annotators) confirms the flagged items.

FindingAccuracy baselines concentrate ≥70% of detections in three African languages; Multilingual-IRT spreads detections across all 28 non-English languages, and its negative tail even catches untranslated English left in questions.

Jaccard overlap matrix between error-detection methods and stacked bars showing per-language distribution of detected translation errors.
Top-1000 critical translation errors by seven scoring methods. (a) Pairwise Jaccard overlap shows Multilingual-IRT's detections are largely complementary to accuracy-based ones. (b) Language distribution: accuracy methods pile onto Wolof, Yoruba, and Zulu, while Multilingual-IRT's detections are spread nearly uniformly, with no language above 11%.
alangi / abasei split discriminability ratio

Surface culture-specific items

HowRanking items by the ratio alangi/abasei surfaces those whose separation between LLMs is driven by language rather than content.

FindingThe top-ranked items probe culture-specific knowledge (U.S. federal subpoenas, Sikh Ragas, Yiddish vocabulary), barely overlap with what accuracy-based rankers find (Jaccard ≤ 0.11), and are systematically harder than universal ones.

Density plot of item difficulty for cultural versus non-cultural items, with cultural items shifted toward higher difficulty.
Cultural items are harder. Distribution of item difficulty bi split by the judge's cultural-specificity label: mean difficulty +1.32 for cultural vs. +0.51 for non-cultural items (p < 10⁻⁴), with a small native-language advantage on top.
Rlng learned cross-lingual correlation matrix

Bonus: recover linguistic structure

FindingThe learned correlation matrix recovers Romance, Slavic, South Asian, and African family blocks with no typological supervision, and shows English clustering with high-resource Arabic, Japanese, and Chinese rather than with its Germanic relatives.

Learned cross-lingual correlation matrix with visible diagonal blocks corresponding to language families.
Fitted language correlation matrix Rlng. Colored boxes group languages by linguistic family or region; no group labels are supplied at fit time.

Citation

@article{lior2026extending,
  title={Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation},
  author={Lior, Gili and Frostig, Tzviel and Stanovsky, Gabriel and Eyal, Matan},
  journal={arXiv preprint arXiv:2606.15643},
  year={2026}
}