Name: Welo Data Multilingual Reasoning Benchmark
Creator: Welo Data
License: https://welodata.ai/privacy-policy/

Evals by Welo Data

Previously known as Model Assessment Suite

Evals by Welo Data measures how frontier and open-source LLMs perform across languages — because the real world is multilingual, and model performance varies significantly across languages and tasks. Our evaluations span 8 languages (Arabic, German, English, Spanish, French, Japanese, Korean, and Turkish) and 4 domains (Legal & Criminal Justice; Health, Medicine & Science; Business, Finance & Economics; and General).

Our evaluations focus on multilingual reasoning. Despite significant advances in LLMs, multilingual reasoning remains a difficult challenge. We've evaluated 51 models using human-validated, natively-authored scenarios and evaluation items designed to surface genuine reasoning capabilities — not pattern-matching on seen data.

Languages

Domains

AR · DE · EN · ES · FR · JA · KO · TR

Multilingual Reasoning

Evaluation Domain

See how your model performs →

Overall RankingsOverall Leaderboard

51 models · Updated March 17, 2026 · Accuracy + 95% CI

Model	Accuracy & Confidence Interval
gpt-5	69.8%±0.7%
gpt-5.4	69.3%±0.7%
o1-preview-2024-09-12	68.0%±0.7%
gpt-4.5-preview	67.6%±0.7%
gemini-3.1-pro-preview	66.4%±0.7%
gemini-2.5-pro-preview-05-06	66.3%±0.7%
claude-sonnet-4-6	65.7%±0.7%
gpt-4-0125-preview	65.7%±0.7%
gpt-4o	63.8%±0.8%
mistral-large@2407	62.8%±0.8%
mistral-large-2411@001	62.7%±0.8%
grok-4-1-fast-reasoning	62.5%±0.8%
qwen.qwen3-235b-a22b-2507-v1:0	61.9%±0.8%
deepseek-r1	60.5%±0.8%
grok-4-0709	60.5%±0.8%
o1-mini	60.3%±0.8%
gpt-4o-mini	59.2%±0.8%
palmyra-x-004	59.2%±0.8%
grok-beta	59.2%±0.8%
claude-3-7-sonnet@20250219	59.0%±0.8%
phi-4	58.8%±0.8%
amazon.nova-pro-v1:0	58.7%±0.8%
llama-4-maverick-17b-128e-instruct-maas	58.2%±0.8%
claude-3-5-sonnet-v2@20241022	57.8%±0.8%
llama-3.1-405b-instruct-maas	57.6%±0.8%
palmyra-x-003-instruct	57.1%±0.8%
claude-3-opus@20240229	56.7%±0.8%
gemini-1.5-pro	56.4%±0.8%
palmyra-fin	56.3%±0.8%
mistral-small-2503@001	56.2%±0.8%
gemini-2.0-flash-exp	56.0%±0.8%
llama-3.1-70b-instruct-maas	55.9%±0.8%
llama-3.2-90b-vision-instruct-maas	55.7%±0.8%
gpt-4	52.6%±0.8%
claude-3-5-haiku@20241022	52.4%±0.8%
gemini-1.5-flash	51.3%±0.8%
Qwen2.5-72b-instruct	50.9%±0.8%
claude-3-sonnet@20240229	50.4%±0.8%
amazon.nova-lite-v1:0	49.7%±0.8%
mistral-nemo-2407	47.5%±0.8%
claude-3-haiku@20240307	47.4%±0.8%
gemini-1.0-pro	45.9%±0.8%
gpt-3.5-turbo-0125	42.9%±0.8%
mistral-large-2402-v1	40.5%±0.8%
jamba-1.5-large@001	40.1%±0.8%
llama-3.1-8b-instruct-maas	39.2%±0.8%
cohere.command-r-plus-v1:0	36.1%±0.8%
gpt-3.5-turbo-instruct	31.1%±0.7%
phi-3.5-mini-instruct	27.3%±0.7%
palmyra-med	23.5%±0.7%
codellama-34b-instruct-hf	18.0%±0.6%

Show all 51 models ↓ Show fewer ↑

By LanguageLanguage Leaderboards

8 languages · Multilingual reasoning · Ranked independently

ARArabic

Top Model

gpt-4.5-preview

68.4%

DEGerman

Top Model

gpt-5

73.4%

ENEnglish

Top Model

gpt-5.4

73.5%

ESSpanish

Top Model

gpt-5

72.5%

FRFrench

Top Model

gpt-5

70.6%

JAJapanese

Top Model

gpt-5.4

68.9%

KOKorean

Top Model

gpt-5.4

69.2%

TRTurkish

Top Model

gpt-5

69.3%

AR ArabicDE GermanEN EnglishES SpanishFR FrenchJA JapaneseKO KoreanTR Turkish

Model	Accuracy & CI
gpt-4.5-preview	68.4%±2.1%
gpt-5	65.9%±2.1%
gemini-3.1-pro-preview	65.8%±2.1%
o1-preview-2024-09-12	64.9%±2.1%
gpt-5.4	64.4%±2.1%
gpt-4-0125-preview	63.6%±2.1%
gpt-4o	63.5%±2.1%
gemini-2.5-pro-preview-05-06	62.8%±2.1%
claude-sonnet-4-6	60.3%±2.2%
grok-4-1-fast-reasoning	60.3%±2.2%
qwen.qwen3-235b-a22b-2507-v1:0	59.8%±2.2%
codellama-34b-instruct-hf	0.2%±0.2%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	73.4%±2%
gpt-5.4	71.9%±2%
o1-preview-2024-09-12	70.0%±2%
claude-sonnet-4-6	68.5%±2.1%
gemini-2.5-pro-preview-05-06	68.1%±2.1%
mistral-large@2407	67.5%±2.1%
gemini-3.1-pro-preview	67.4%±2.1%
gpt-4.5-preview	67.4%±2.1%
gpt-4-0125-preview	66.2%±2.1%
mistral-large-2411@001	65.9%±2.1%
codellama-34b-instruct-hf	11.7%±1.4%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5.4	73.5%±2%
gpt-5	71.5%±2%
o1-preview-2024-09-12	71.0%±2%
gpt-4.5-preview	70.9%±2%
claude-sonnet-4-6	67.1%±2.1%
gpt-4o	66.6%±2.1%
gpt-4-0125-preview	66.6%±2.1%
grok-4-1-fast-reasoning	65.3%±2.1%
phi-4	65.1%±2.1%
palmyra-x-004	65.1%±2.1%
palmyra-med	30.5%±2%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	72.5%±2%
gpt-4.5-preview	71.6%±2%
o1-preview-2024-09-12	71.5%±2%
gpt-5.4	70.5%±2%
claude-sonnet-4-6	70.0%±2%
gemini-3.1-pro-preview	68.7%±2.1%
gpt-4-0125-preview	68.2%±2.1%
gpt-4o	67.5%±2.1%
grok-4-1-fast-reasoning	67.3%±2.1%
gemini-2.5-pro-preview-05-06	67.2%±2.1%
jamba-1.5-large@001	21.9%±1.8%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	70.6%±2%
gpt-5.4	68.8%±2%
gpt-4-0125-preview	68.3%±2.1%
gpt-4.5-preview	67.8%±2.1%
mistral-large@2407	66.6%±2.1%
gpt-4o	66.5%±2.1%
mistral-large-2411@001	65.7%±2.1%
o1-preview-2024-09-12	65.6%±2.1%
claude-sonnet-4-6	64.2%±2.1%
gemini-3.1-pro-preview	64.0%±2.1%
palmyra-med	28.8%±2%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5.4	68.9%±2%
gemini-2.5-pro-preview-05-06	68.4%±2.1%
gemini-3.1-pro-preview	67.8%±2.1%
gpt-5	67.0%±2.1%
o1-preview-2024-09-12	66.9%±2.1%
claude-sonnet-4-6	64.2%±2.1%
gpt-4.5-preview	63.8%±2.1%
mistral-large@2407	63.5%±2.1%
mistral-large-2411@001	63.5%±2.1%
gpt-4-0125-preview	63.3%±2.1%
codellama-34b-instruct-hf	13.3%±1.5%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5.4	69.2%±2%
gpt-5	68.3%±2.1%
gemini-2.5-pro-preview-05-06	68.3%±2.1%
gemini-3.1-pro-preview	67.4%±2.1%
claude-sonnet-4-6	66.3%±2.1%
o1-preview-2024-09-12	65.3%±2.1%
gpt-4-0125-preview	64.5%±2.1%
gpt-4.5-preview	62.7%±2.1%
qwen.qwen3-235b-a22b-2507-v1:0	61.4%±2.2%
mistral-large@2407	60.5%±2.2%
codellama-34b-instruct-hf	3.2%±0.8%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	69.3%±2%
o1-preview-2024-09-12	69.0%±2%
gpt-4.5-preview	68.0%±2.1%
gpt-5.4	67.5%±2.1%
gemini-2.5-pro-preview-05-06	67.4%±2.1%
gemini-3.1-pro-preview	67.1%±2.1%
claude-sonnet-4-6	65.3%±2.1%
gpt-4-0125-preview	64.8%±2.1%
grok-4-1-fast-reasoning	62.4%±2.1%
gpt-4o	61.5%±2.2%
codellama-34b-instruct-hf	7.6%±1.2%

Show all 51 models ↓Show fewer ↑

By DomainDomain Leaderboards

4 domains · Multilingual reasoning · Ranked independently

HMSHealth & Science

Top Model

gpt-5.4

83.4%

BFEBusiness

Top Model

gpt-5

76.5%

GENGeneral

Top Model

gpt-5.4

76.0%

LCJLegal

Top Model

gpt-5

50.3%

HMS Health & ScienceBFE BusinessGEN GeneralLCJ Legal

Model	Accuracy & CI
gpt-5.4	83.4%±1.3%
gemini-3.1-pro-preview	83.1%±1.3%
claude-sonnet-4-6	83.1%±1.3%
gpt-5	82.0%±1.3%
grok-4-1-fast-reasoning	81.4%±1.4%
gemini-2.5-pro-preview-05-06	80.9%±1.4%
gpt-4-0125-preview	80.3%±1.4%
o1-preview-2024-09-12	79.8%±1.4%
gpt-4.5-preview	78.4%±1.4%
grok-4-0709	77.6%±1.5%
codellama-34b-instruct-hf	17.6%±1.3%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	76.5%±1.2%
o1-preview-2024-09-12	72.3%±1.3%
gemini-2.5-pro-preview-05-06	71.8%±1.3%
gemini-3.1-pro-preview	70.9%±1.3%
gpt-4-0125-preview	70.9%±1.3%
gpt-5.4	70.7%±1.3%
gpt-4.5-preview	70.2%±1.3%
deepseek-r1	67.7%±1.3%
o1-mini	67.3%±1.3%
claude-sonnet-4-6	67.1%±1.3%
codellama-34b-instruct-hf	16.9%±1.1%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5.4	76.0%±1.3%
gpt-4.5-preview	75.0%±1.3%
o1-preview-2024-09-12	73.2%±1.4%
gpt-4-0125-preview	71.9%±1.4%
gemini-3.1-pro-preview	71.7%±1.4%
claude-sonnet-4-6	71.2%±1.4%
gpt-5	70.5%±1.4%
gpt-4o	69.9%±1.4%
gemini-2.5-pro-preview-05-06	69.7%±1.4%
mistral-large-2411@001	69.4%±1.4%
phi-3.5-mini-instruct	19.2%±1.2%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	50.3%±1.6%
gpt-5.4	48.7%±1.6%
gpt-4.5-preview	47.3%±1.6%
o1-preview-2024-09-12	47.2%±1.6%
gpt-4o	45.0%±1.6%
claude-sonnet-4-6	43.7%±1.6%
gemini-2.5-pro-preview-05-06	43.6%±1.6%
grok-beta	43.0%±1.6%
mistral-large@2407	42.6%±1.6%
qwen.qwen3-235b-a22b-2507-v1:0	42.3%±1.6%
codellama-34b-instruct-hf	15.9%±1.2%

Show all 51 models ↓Show fewer ↑

How it worksMethodology

01 — DATASET DESIGN

Novel, Human-Authored Scenarios

The dataset consists of three components: fact-based scenarios, scenario-based narratives, and question & answer pairs. Domain experts generated novel, fact-based scenarios using terminology from their respective fields. Writers then used those scenarios to produce narratives from different character perspectives. Finally, experts in Cognitive Science, Philosophy, Linguistics, and NLP research generated Q&A pairs based on the causal events in each scenario. Because the stories and questions are entirely original, models cannot rely on memorized training data — their multilingual reasoning capabilities are genuinely tested.

02 — SCENARIO TYPES

Three Causation Categories

Each domain contains six scenarios divided across three types. Standard Causation depicts clear cause-and-effect relationships without norm violations. Normality Violation — Explicit introduces scenarios where at least one explicit norm is violated (a policy, rule, law, or regulation). Normality Violation — Implicit involves violations of informal, unwritten rules such as social norms or everyday conventions. Each scenario includes 9–14 questions depending on type.

03 — QUESTION CATEGORIES

What the Evaluation Measures

Models are evaluated on their ability to identify causal relationships, discern between a cause and a confounder, determine normality violations in a chain of causal events, and perform these tasks in the context of language variation. Question types include binary and multiple-choice formats across four categories: Causal Discovery (Cause), Causal Discovery (Confounder), Language Variation, and Normality Variation. Despite significant advances in LLMs, multilingual reasoning remains a difficult challenge.

04 — TRANSLATION APPROACH

Cross-Lingual Consistency

Stories and questions were originally written in English and professionally translated across seven additional languages. Translators were instructed to avoid word-for-word translation and instead prioritize retaining original semantics while using natural word choices and grammatical structures appropriate to each language. This standardization enables comparison of the same model on the same scenario across languages, while preserving linguistic and cultural naturalness.

05 — DOMAINS

Four Subject Areas

The dataset covers four domains: Legal & Criminal Justice; Health, Medicine & Science; Business, Finance & Economics; and General. Each domain includes scenarios across all three causation types, ensuring that results reflect reasoning capability rather than domain familiarity alone.

06 — STATISTICS

Confidence Intervals

All accuracy scores include approximate 95% confidence intervals. The ± values shown in each leaderboard represent the margin of error around each accuracy estimate, giving a range rather than just a point estimate. Models with overlapping confidence intervals should be considered statistically tied.

Why We Are DifferentBenchmark Complexity

Available public and private reasoning benchmarks are rather simple, failing to fully evaluate causal capabilities. By contrast, Welo Data's benchmarks include a novel story to provide context to the model followed by a variety of questions that test multiple causal relationships.

Competitor: Example 1

Event A: The sun came out.

Event B: John put his sunglasses on.

Question: Which event Caused the other?

Option 1: Event A caused event B.

Option 2: Event B caused the event.

Evals by Welo Data Multilingual Reasoning : Example 1

Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.

Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, such as refrigeration of all dairy items at all times. A late shift followed by a morning shift at Beans4All is exhausting…

Question: What caused Yuki to be put on probation?

a. Sam was afraid of Fran b. Sam was tired from a late shift c. Sam dropped the hot beverage d. Yuki cleaned up Sam's spill, forgetting about the oat milk e. Alex stepped in to help at the counter after Sam went home f. Jaime became ill at work g. The hospital staff determined Jaime had been food-poisoned h. The café was fined by the health department i. There are no causal relationships. j. There is not enough information.

Evals by Welo Data Multilingual Reasoning : Example 2

Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, something we take seriously at my job. I worked my way up to manager at Beans4All…

Question: What caused Yuki to be put on probation?

ResearchRelated Publications

A Novel Multi-Select Framework for Evaluating Causal Reasoning in LLMs

Welo Data Research Labs

→

Diagnosing Performance Gaps in Causal Reasoning

Welo Data Research Labs

→

Global Security Blind Spots: LLM Safety Failures in Low-Resource Languages

Welo Data Research Labs

→

Get in touch

Want to see how your model performs?

Work with Welo Data to benchmark your model across the languages and reasoning tasks that matter to your users.