Name: Welo Data Multilingual Reasoning Benchmark
Creator: Welo Data
License: https://welodata.ai/privacy-policy/

Evals by Welo Data

Previously known as Model Assessment Suite

Evals by Welo Data measures how frontier and open-source LLMs perform across languages — because the real world is multilingual, and model performance varies significantly across languages and tasks. Our evaluations cover two dimensions: safety and reasoning.

Safety Evaluation

We measure how often models produce unsafe responses to harmful prompts across English, medium-resource, and low-resource languages — and how that rate changes as language resource level decreases.

See the Safety Leaderboard →

Models evaluated

Languages tested

EN · 15 medium-resource · 35 low-resource

Multilingual Reasoning Evaluation

We’ve evaluated 51 models across 8 languages and 4 domains using human-validated, natively-authored scenarios designed to surface genuine causal reasoning capabilities — not pattern-matching on seen data.

See the Reasoning Leaderboard →

Languages

Domains

AR · DE · EN · ES · FR · JA · KO · TR

Safety Leaderboard

9 models · 51 languages · Unsafe response rate + 95% CI

How often do leading LLMs produce unsafe responses — and how does that rate change across language resource levels? This leaderboard measures unsafe response rates across English, medium-resource, and low-resource languages for leading multilingual models. Every model tested became less safe outside English. This benchmark exists to make that visible.

#	Model	Englishunsafe rate (95% CI)	Medium-resourceunsafe rate (95% CI)	Low-resourceunsafe rate (95% CI) ▲	DegradationEN → low-resource

The Degradation column shows the unsafe rate increase from English to low-resource languages. A large gap does not mean a model is the most dangerous overall — it depends on where the model starts. A model with strong English safety has more room to fall. The metrics that reflect real-world risk for multilingual deployments are the medium and low-resource unsafe rates.

How it works

01 — PROMPT SELECTION

Curated for maximum signal

50 harmful prompts drawn from the dataset underlying our 2026 white paper, Global Security Blind Spots: LLM Safety Failures in Low-Resource Languages. The original dataset was drawn from publicly available safety benchmarks including AdvBench, HarmBench, MultiJail, PHTest, and DiaSafety. This evaluation spans six categories: Dangerous Behavior & Criminal Content (20), Hate & Discrimination (10), Violence & Threats (10), Self-Harm & Suicide (5), Misinformation & Disinformation (4), and Model Security (1). Prompts were chosen to maximize the gap between English and low-resource unsafe rates. Because the prompts target genuinely harmful outputs, this benchmark tests safety alignment where it is most important — not where models are most likely to succeed.

02 — LANGUAGE TIERS

English, medium-resource, and low-resource

51 languages organized into three tiers. English serves as the alignment baseline. The 15 medium-resource languages include Amharic, Bengali, Georgian, Hausa, Hebrew, Indonesian, Kazakh, Malay, Romanian, Swahili, Tagalog, Tamil, Thai, Ukrainian, and Urdu — languages with meaningful web presence and hundreds of millions of speakers, but without English-level safety alignment training. The 35 low-resource languages are selected from an empirically evaluated pool of 78 languages, prioritizing those with the largest English-to-low-resource safety gaps while maintaining linguistic diversity. 23 language families are represented.

03 — EVALUATION PIPELINE

Translate, collect, back-translate, judge

Prompts were translated from English into each target language using the Google Cloud Translation API. Model responses were collected in the target language, then back-translated into English using the same API. A consistent English-language safety judge assigned a binary safe/unsafe label to each back-translated response. This approach allows identical evaluation criteria to be applied across all models and all languages. Prompt translations were performed once and reused across models; back-translations were performed per model.

04 — CONFIDENCE INTERVALS

What the ± values mean

All unsafe rates are reported with 95% confidence intervals. English confidence intervals are wider than medium- and low-resource intervals because the English sample is 50 prompt-response pairs per model, versus 750 for medium-resource and 1,750 for low-resource. Models with overlapping confidence intervals should be considered statistically tied.

Get in touch

Want to see how your model performs?

Work with Welo Data to benchmark your model across the languages and tasks that matter to your users.

Multilingual Causal Reasoning Leaderboard

51 models · Updated March 17, 2026 · Accuracy + 95% CI

Overall Rankings

Model	Accuracy & Confidence Interval
gpt-5	69.8%±0.7%
gpt-5.4	69.3%±0.7%
o1-preview-2024-09-12	68.0%±0.7%
gpt-4.5-preview	67.6%±0.7%
gemini-3.1-pro-preview	66.4%±0.7%
gemini-2.5-pro-preview-05-06	66.3%±0.7%
claude-sonnet-4-6	65.7%±0.7%
gpt-4-0125-preview	65.7%±0.7%
gpt-4o	63.8%±0.8%
mistral-large@2407	62.8%±0.8%
mistral-large-2411@001	62.7%±0.8%
grok-4-1-fast-reasoning	62.5%±0.8%
qwen.qwen3-235b-a22b-2507-v1:0	61.9%±0.8%
deepseek-r1	60.5%±0.8%
grok-4-0709	60.5%±0.8%
o1-mini	60.3%±0.8%
gpt-4o-mini	59.2%±0.8%
palmyra-x-004	59.2%±0.8%
grok-beta	59.2%±0.8%
claude-3-7-sonnet@20250219	59.0%±0.8%
phi-4	58.8%±0.8%
amazon.nova-pro-v1:0	58.7%±0.8%
llama-4-maverick-17b-128e-instruct-maas	58.2%±0.8%
claude-3-5-sonnet-v2@20241022	57.8%±0.8%
llama-3.1-405b-instruct-maas	57.6%±0.8%
palmyra-x-003-instruct	57.1%±0.8%
claude-3-opus@20240229	56.7%±0.8%
gemini-1.5-pro	56.4%±0.8%
palmyra-fin	56.3%±0.8%
mistral-small-2503@001	56.2%±0.8%
gemini-2.0-flash-exp	56.0%±0.8%
llama-3.1-70b-instruct-maas	55.9%±0.8%
llama-3.2-90b-vision-instruct-maas	55.7%±0.8%
gpt-4	52.6%±0.8%
claude-3-5-haiku@20241022	52.4%±0.8%
gemini-1.5-flash	51.3%±0.8%
Qwen2.5-72b-instruct	50.9%±0.8%
claude-3-sonnet@20240229	50.4%±0.8%
amazon.nova-lite-v1:0	49.7%±0.8%
mistral-nemo-2407	47.5%±0.8%
claude-3-haiku@20240307	47.4%±0.8%
gemini-1.0-pro	45.9%±0.8%
gpt-3.5-turbo-0125	42.9%±0.8%
mistral-large-2402-v1	40.5%±0.8%
jamba-1.5-large@001	40.1%±0.8%
llama-3.1-8b-instruct-maas	39.2%±0.8%
cohere.command-r-plus-v1:0	36.1%±0.8%
gpt-3.5-turbo-instruct	31.1%±0.7%
phi-3.5-mini-instruct	27.3%±0.7%
palmyra-med	23.5%±0.7%
codellama-34b-instruct-hf	18.0%±0.6%

Show all 51 models ↓ Show fewer ↑

Language Leaderboards

8 languages · Multilingual reasoning · Ranked independently

ARArabic

Top Model

gpt-4.5-preview

68.4%

DEGerman

Top Model

gpt-5

73.4%

ENEnglish

Top Model

gpt-5.4

73.5%

ESSpanish

Top Model

gpt-5

72.5%

FRFrench

Top Model

gpt-5

70.6%

JAJapanese

Top Model

gpt-5.4

68.9%

KOKorean

Top Model

gpt-5.4

69.2%

TRTurkish

Top Model

gpt-5

69.3%

AR ArabicDE GermanEN EnglishES SpanishFR FrenchJA JapaneseKO KoreanTR Turkish

Model	Accuracy & CI
gpt-4.5-preview	68.4%±2.1%
gpt-5	65.9%±2.1%
gemini-3.1-pro-preview	65.8%±2.1%
o1-preview-2024-09-12	64.9%±2.1%
gpt-5.4	64.4%±2.1%
gpt-4-0125-preview	63.6%±2.1%
gpt-4o	63.5%±2.1%
gemini-2.5-pro-preview-05-06	62.8%±2.1%
claude-sonnet-4-6	60.3%±2.2%
grok-4-1-fast-reasoning	60.3%±2.2%
qwen.qwen3-235b-a22b-2507-v1:0	59.8%±2.2%
codellama-34b-instruct-hf	0.2%±0.2%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	73.4%±2%
gpt-5.4	71.9%±2%
o1-preview-2024-09-12	70.0%±2%
claude-sonnet-4-6	68.5%±2.1%
gemini-2.5-pro-preview-05-06	68.1%±2.1%
mistral-large@2407	67.5%±2.1%
gemini-3.1-pro-preview	67.4%±2.1%
gpt-4.5-preview	67.4%±2.1%
gpt-4-0125-preview	66.2%±2.1%
mistral-large-2411@001	65.9%±2.1%
codellama-34b-instruct-hf	11.7%±1.4%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5.4	73.5%±2%
gpt-5	71.5%±2%
o1-preview-2024-09-12	71.0%±2%
gpt-4.5-preview	70.9%±2%
claude-sonnet-4-6	67.1%±2.1%
gpt-4o	66.6%±2.1%
gpt-4-0125-preview	66.6%±2.1%
grok-4-1-fast-reasoning	65.3%±2.1%
phi-4	65.1%±2.1%
palmyra-x-004	65.1%±2.1%
palmyra-med	30.5%±2%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	72.5%±2%
gpt-4.5-preview	71.6%±2%
o1-preview-2024-09-12	71.5%±2%
gpt-5.4	70.5%±2%
claude-sonnet-4-6	70.0%±2%
gemini-3.1-pro-preview	68.7%±2.1%
gpt-4-0125-preview	68.2%±2.1%
gpt-4o	67.5%±2.1%
grok-4-1-fast-reasoning	67.3%±2.1%
gemini-2.5-pro-preview-05-06	67.2%±2.1%
jamba-1.5-large@001	21.9%±1.8%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	70.6%±2%
gpt-5.4	68.8%±2%
gpt-4-0125-preview	68.3%±2.1%
gpt-4.5-preview	67.8%±2.1%
mistral-large@2407	66.6%±2.1%
gpt-4o	66.5%±2.1%
mistral-large-2411@001	65.7%±2.1%
o1-preview-2024-09-12	65.6%±2.1%
claude-sonnet-4-6	64.2%±2.1%
gemini-3.1-pro-preview	64.0%±2.1%
palmyra-med	28.8%±2%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5.4	68.9%±2%
gemini-2.5-pro-preview-05-06	68.4%±2.1%
gemini-3.1-pro-preview	67.8%±2.1%
gpt-5	67.0%±2.1%
o1-preview-2024-09-12	66.9%±2.1%
claude-sonnet-4-6	64.2%±2.1%
gpt-4.5-preview	63.8%±2.1%
mistral-large@2407	63.5%±2.1%
mistral-large-2411@001	63.5%±2.1%
gpt-4-0125-preview	63.3%±2.1%
codellama-34b-instruct-hf	13.3%±1.5%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5.4	69.2%±2%
gpt-5	68.3%±2.1%
gemini-2.5-pro-preview-05-06	68.3%±2.1%
gemini-3.1-pro-preview	67.4%±2.1%
claude-sonnet-4-6	66.3%±2.1%
o1-preview-2024-09-12	65.3%±2.1%
gpt-4-0125-preview	64.5%±2.1%
gpt-4.5-preview	62.7%±2.1%
qwen.qwen3-235b-a22b-2507-v1:0	61.4%±2.2%
mistral-large@2407	60.5%±2.2%
codellama-34b-instruct-hf	3.2%±0.8%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	69.3%±2%
o1-preview-2024-09-12	69.0%±2%
gpt-4.5-preview	68.0%±2.1%
gpt-5.4	67.5%±2.1%
gemini-2.5-pro-preview-05-06	67.4%±2.1%
gemini-3.1-pro-preview	67.1%±2.1%
claude-sonnet-4-6	65.3%±2.1%
gpt-4-0125-preview	64.8%±2.1%
grok-4-1-fast-reasoning	62.4%±2.1%
gpt-4o	61.5%±2.2%
codellama-34b-instruct-hf	7.6%±1.2%

Show all 51 models ↓Show fewer ↑

Domain Leaderboards

4 domains · Multilingual reasoning · Ranked independently

HMSHealth & Science

Top Model

gpt-5.4

83.4%

BFEBusiness

Top Model

gpt-5

76.5%

GENGeneral

Top Model

gpt-5.4

76.0%

LCJLegal

Top Model

gpt-5

50.3%

HMS Health & ScienceBFE BusinessGEN GeneralLCJ Legal

Model	Accuracy & CI
gpt-5.4	83.4%±1.3%
gemini-3.1-pro-preview	83.1%±1.3%
claude-sonnet-4-6	83.1%±1.3%
gpt-5	82.0%±1.3%
grok-4-1-fast-reasoning	81.4%±1.4%
gemini-2.5-pro-preview-05-06	80.9%±1.4%
gpt-4-0125-preview	80.3%±1.4%
o1-preview-2024-09-12	79.8%±1.4%
gpt-4.5-preview	78.4%±1.4%
grok-4-0709	77.6%±1.5%
codellama-34b-instruct-hf	17.6%±1.3%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	76.5%±1.2%
o1-preview-2024-09-12	72.3%±1.3%
gemini-2.5-pro-preview-05-06	71.8%±1.3%
gemini-3.1-pro-preview	70.9%±1.3%
gpt-4-0125-preview	70.9%±1.3%
gpt-5.4	70.7%±1.3%
gpt-4.5-preview	70.2%±1.3%
deepseek-r1	67.7%±1.3%
o1-mini	67.3%±1.3%
claude-sonnet-4-6	67.1%±1.3%
codellama-34b-instruct-hf	16.9%±1.1%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5.4	76.0%±1.3%
gpt-4.5-preview	75.0%±1.3%
o1-preview-2024-09-12	73.2%±1.4%
gpt-4-0125-preview	71.9%±1.4%
gemini-3.1-pro-preview	71.7%±1.4%
claude-sonnet-4-6	71.2%±1.4%
gpt-5	70.5%±1.4%
gpt-4o	69.9%±1.4%
gemini-2.5-pro-preview-05-06	69.7%±1.4%
mistral-large-2411@001	69.4%±1.4%
phi-3.5-mini-instruct	19.2%±1.2%

Show all 51 models ↓Show fewer ↑

Model	Accuracy & CI
gpt-5	50.3%±1.6%
gpt-5.4	48.7%±1.6%
gpt-4.5-preview	47.3%±1.6%
o1-preview-2024-09-12	47.2%±1.6%
gpt-4o	45.0%±1.6%
claude-sonnet-4-6	43.7%±1.6%
gemini-2.5-pro-preview-05-06	43.6%±1.6%
grok-beta	43.0%±1.6%
mistral-large@2407	42.6%±1.6%
qwen.qwen3-235b-a22b-2507-v1:0	42.3%±1.6%
codellama-34b-instruct-hf	15.9%±1.2%

Show all 51 models ↓Show fewer ↑

Methodology

01 — DATASET DESIGN

Novel, Human-Authored Scenarios

The dataset consists of three components: fact-based scenarios, scenario-based narratives, and question & answer pairs. Domain experts generated novel, fact-based scenarios using terminology from their respective fields. Writers then used those scenarios to produce narratives from different character perspectives. Finally, experts in Cognitive Science, Philosophy, Linguistics, and NLP research generated Q&A pairs based on the causal events in each scenario. Because the stories and questions are entirely original, models cannot rely on memorized training data — their multilingual reasoning capabilities are genuinely tested.

02 — SCENARIO TYPES

Three Causation Categories

Each domain contains six scenarios divided across three types. Standard Causation depicts clear cause-and-effect relationships without norm violations. Normality Violation — Explicit introduces scenarios where at least one explicit norm is violated (a policy, rule, law, or regulation). Normality Violation — Implicit involves violations of informal, unwritten rules such as social norms or everyday conventions. Each scenario includes 9–14 questions depending on type.

03 — QUESTION CATEGORIES

What the Evaluation Measures

Models are evaluated on their ability to identify causal relationships, discern between a cause and a confounder, determine normality violations in a chain of causal events, and perform these tasks in the context of language variation. Question types include binary and multiple-choice formats across four categories: Causal Discovery (Cause), Causal Discovery (Confounder), Language Variation, and Normality Variation. Despite significant advances in LLMs, multilingual reasoning remains a difficult challenge.

04 — TRANSLATION APPROACH

Cross-Lingual Consistency

Stories and questions were originally written in English and professionally translated across seven additional languages. Translators were instructed to avoid word-for-word translation and instead prioritize retaining original semantics while using natural word choices and grammatical structures appropriate to each language. This standardization enables comparison of the same model on the same scenario across languages, while preserving linguistic and cultural naturalness.

05 — DOMAINS

Four Subject Areas

The dataset covers four domains: Legal & Criminal Justice; Health, Medicine & Science; Business, Finance & Economics; and General. Each domain includes scenarios across all three causation types, ensuring that results reflect reasoning capability rather than domain familiarity alone.

06 — STATISTICS

Confidence Intervals

All accuracy scores include approximate 95% confidence intervals. The ± values shown in each leaderboard represent the margin of error around each accuracy estimate, giving a range rather than just a point estimate. Models with overlapping confidence intervals should be considered statistically tied.

Benchmark Complexity

Available public and private reasoning benchmarks are rather simple, failing to fully evaluate causal capabilities. By contrast, Welo Data's benchmarks include a novel story to provide context to the model followed by a variety of questions that test multiple causal relationships.

Competitor: Example 1

Event A: The sun came out.

Event B: John put his sunglasses on.

Question: Which event Caused the other?

Option 1: Event A caused event B.

Option 2: Event B caused the event.

Evals by Welo Data Multilingual Reasoning : Example 1

Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.

Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, such as refrigeration of all dairy items at all times. A late shift followed by a morning shift at Beans4All is exhausting…

Question: What caused Yuki to be put on probation?

a. Sam was afraid of Fran b. Sam was tired from a late shift c. Sam dropped the hot beverage d. Yuki cleaned up Sam's spill, forgetting about the oat milk e. Alex stepped in to help at the counter after Sam went home f. Jaime became ill at work g. The hospital staff determined Jaime had been food-poisoned h. The café was fined by the health department i. There are no causal relationships. j. There is not enough information.

Evals by Welo Data Multilingual Reasoning : Example 2

Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, something we take seriously at my job. I worked my way up to manager at Beans4All…

Question: What caused Yuki to be put on probation?