Multilingual-IRT: Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation

Parallel multilingual benchmarks test the same items across many languages, yet today they are scored as if every (item, language) pair were unrelated. Multilingual-IRT models that parallel structure directly, and its fitted parameters turn three chronic benchmark problems into three practical tools.

Problem · cost scales with languages

Efficient evaluation

Fit on a fraction of the predictions tensor and impute the rest, with no need to run every LLM on every item in every language.

11–16% lower BCE than the strongest accuracy baseline; 60% of instances can go unevaluated at negligible loss.

Problem · silent translation errors

Translation-error detection

Per-language difficulty shifts flag items that became unexpectedly harder (or easier) in a given language.

Errors surfaced across all 28 non-English languages, where accuracy baselines pile up in just three.

Problem · culture vs. general knowledge

Culture-specific items

Splitting discriminability into content vs. language components exposes items whose answers depend on cultural context.

26.1% precision at top-2K vs. 19.8% base rate, surfacing items accuracy methods miss.

Abstract

Multilingual benchmarks are central to evaluating large language models (LLMs) across languages, but they suffer from three issues: exhaustive evaluation scales linearly with the number of languages, automatic translation introduces errors that are easily missed at scale, and some items conflate general and culture-specific knowledge. We address all three with a unified statistical framework, Multilingual-IRT, which extends Item Response Theory with per-language difficulty deviations, split discriminability separating content from language effects, and per-language ability residuals.

Fitting Multilingual-IRT on 25 LLMs across 29 languages of MMLU-Pro-X, we show that its fitted parameters support three practical applications: predicting unobserved (item, LLM, language) instances with 11–16% lower binary cross-entropy than the strongest accuracy-based baseline, surfacing candidate translation errors distributed across all 28 non-English languages (whereas accuracy-based baselines concentrate detections in a few languages), and recovering culture-specific items that accuracy-based baselines miss.

The model in one line

Multilingual-IRT extends the classic 2PL response model with three components, each capturing a structural fact about parallel multilingual benchmarks:

logit(p_ijl) = a^base_i θ_j + a^lang_i (ε_j)_l − (b_i + d_il)

item i · LLM j · language l · ε_j ~ N(0, σ²R^lng) with a learned cross-lingual correlation matrix

d_il

Per-language difficulty deviation. Translation can make the same item harder in one language than another. A large positive d_il is a red flag for a translation error.

ε_j , R^lng

Correlated ability residuals. An LLM's proficiency deviates from its overall ability per language, with deviations coupled across languages through a learned correlation matrix.

a^base_i / a^lang_i

Split discriminability. How much an item separates LLMs by content, versus by language. A high ratio a^lang/a^base signals culture-specific knowledge.

b_i, a_i shared item parameters across languages

Evaluate a fraction, predict the rest

HowEvaluating 25 LLMs on 11,759 items in 29 languages means filling an ~8.5M-entry predictions tensor. Multilingual-IRT fits on a sampled fraction ρ and predicts the rest, and fitting it costs negligibly compared to LLM inference.

FindingAt every observation fraction, Multilingual-IRT beats all accuracy-based baselines, and reaches near-full accuracy with only 40% of instances observed.

Line charts of binary cross-entropy and ROC-AUC on unseen instances as a function of observed fraction, with Multilingual-IRT outperforming all baselines. — **Prediction over unseen (item, LLM, language) instances.** (a) Binary cross-entropy and (b) ROC-AUC versus the fraction of observed instances ρ. Multilingual-IRT (red) achieves 11–16% lower BCE than the strongest accuracy-based baseline at every ρ.

d_il per-language difficulty deviation

Find diverse translation errors

HowRanking instances by the standardized difficulty shift s(i,l) = d_il/SE(d_il) surfaces candidate translation errors for targeted review. An LLM-judge audit (validated by human annotators) confirms the flagged items.

FindingAccuracy baselines concentrate ≥70% of detections in three African languages; Multilingual-IRT spreads detections across all 28 non-English languages, and its negative tail even catches untranslated English left in questions.

Jaccard overlap matrix between error-detection methods and stacked bars showing per-language distribution of detected translation errors. — **Top-1000 critical translation errors by seven scoring methods.** (a) Pairwise Jaccard overlap shows Multilingual-IRT's detections are largely complementary to accuracy-based ones. (b) Language distribution: accuracy methods pile onto Wolof, Yoruba, and Zulu, while Multilingual-IRT's detections are spread nearly uniformly, with no language above 11%.

a^lang_i / a^base_i split discriminability ratio

Surface culture-specific items

HowRanking items by the ratio a^lang_i/a^base_i surfaces those whose separation between LLMs is driven by language rather than content.

FindingThe top-ranked items probe culture-specific knowledge (U.S. federal subpoenas, Sikh Ragas, Yiddish vocabulary), barely overlap with what accuracy-based rankers find (Jaccard ≤ 0.11), and are systematically harder than universal ones.

Density plot of item difficulty for cultural versus non-cultural items, with cultural items shifted toward higher difficulty. — **Cultural items are harder.** Distribution of item difficulty b_i split by the judge's cultural-specificity label: mean difficulty +1.32 for cultural vs. +0.51 for non-cultural items (p < 10⁻⁴), with a small native-language advantage on top.

R^lng learned cross-lingual correlation matrix

Bonus: recover linguistic structure

FindingThe learned correlation matrix recovers Romance, Slavic, South Asian, and African family blocks with no typological supervision, and shows English clustering with high-resource Arabic, Japanese, and Chinese rather than with its Germanic relatives.

Learned cross-lingual correlation matrix with visible diagonal blocks corresponding to language families. — **Fitted language correlation matrix R^lng.** Colored boxes group languages by linguistic family or region; no group labels are supplied at fit time.

Citation

@article{lior2026extending,
  title={Extending Item Response Theory for Efficient and Meaningful Multilingual Evaluation},
  author={Lior, Gili and Frostig, Tzviel and Stanovsky, Gabriel and Eyal, Matan},
  journal={arXiv preprint arXiv:2606.15643},
  year={2026}
}