Evals by Welo Data
Previously known as Model Assessment Suite
Evals by Welo Data measures how frontier and open-source LLMs perform across languages — because the real world is multilingual, and model performance varies significantly across languages and tasks. Our evaluations cover two dimensions: safety and reasoning.
Safety Evaluation
We measure how often models produce unsafe responses to harmful prompts across English, medium-resource, and low-resource languages — and how that rate changes as language resource level decreases.
9
Models evaluated
51
Languages tested
EN · 15 medium-resource · 35 low-resource
Multilingual Reasoning Evaluation
We’ve evaluated 51 models across 8 languages and 4 domains using human-validated, natively-authored scenarios designed to surface genuine causal reasoning capabilities — not pattern-matching on seen data.
8
Languages
4
Domains
AR · DE · EN · ES · FR · JA · KO · TR
Safety Leaderboard
9 models · 51 languages · Unsafe response rate + 95% CI
How often do leading LLMs produce unsafe responses — and how does that rate change across language resource levels? This leaderboard measures unsafe response rates across English, medium-resource, and low-resource languages for leading multilingual models. Every model tested became less safe outside English. This benchmark exists to make that visible.
| # | Model | Englishunsafe rate (95% CI) | Medium-resourceunsafe rate (95% CI) | Low-resourceunsafe rate (95% CI) ▲ | DegradationEN → low-resource |
|---|
The Degradation column shows the unsafe rate increase from English to low-resource languages. The metrics that reflect real-world risk for multilingual deployments are the medium and low-resource unsafe rates.
How it works
01 — PROMPT SELECTION
Curated for maximum signal
50 harmful prompts drawn from the dataset underlying our 2026 white paper, Global Security Blind Spots: LLM Safety Failures in Low-Resource Languages. The original dataset was drawn from publicly available safety benchmarks including AdvBench, HarmBench, MultiJail, PHTest, and DiaSafety. This evaluation spans six categories: Dangerous Behavior & Criminal Content (20), Hate & Discrimination (10), Violence & Threats (10), Self-Harm & Suicide (5), Misinformation & Disinformation (4), and Model Security (1). Prompts were chosen to maximize the gap between English and low-resource unsafe rates. Because the prompts target genuinely harmful outputs, this benchmark tests safety alignment where it is most important — not where models are most likely to succeed.
02 — LANGUAGE TIERS
English, medium-resource, and low-resource
51 languages organized into three tiers. English serves as the alignment baseline. The 15 medium-resource languages include Amharic, Bengali, Georgian, Hausa, Hebrew, Indonesian, Kazakh, Malay, Romanian, Swahili, Tagalog, Tamil, Thai, Ukrainian, and Urdu — languages with meaningful web presence and hundreds of millions of speakers, but without English-level safety alignment training. The 35 low-resource languages are selected from an empirically evaluated pool of 78 languages, prioritizing those with the largest English-to-low-resource safety gaps while maintaining linguistic diversity. 23 language families are represented.
03 — EVALUATION PIPELINE
Translate, collect, back-translate, judge
Prompts were translated from English into each target language using the Google Cloud Translation API. Model responses were collected in the target language, then back-translated into English using the same API. A consistent English-language safety judge assigned a binary safe/unsafe label to each back-translated response. This approach allows identical evaluation criteria to be applied across all models and all languages. Prompt translations were performed once and reused across models; back-translations were performed per model.
04 — CONFIDENCE INTERVALS
What the ± values mean
All unsafe rates are reported with 95% confidence intervals. English confidence intervals are wider than medium- and low-resource intervals because the English sample is 50 prompt-response pairs per model, versus 750 for medium-resource and 1,750 for low-resource. Models with overlapping confidence intervals should be considered statistically tied.
GET IN TOUCH
Want to see how your model performs?
Work with Welo Data to benchmark your model across the languages and tasks that matter to your users.
Multilingual Causal Reasoning Leaderboard
51 models · Updated March 17, 2026 · Accuracy + 95% CI
| Model | Accuracy & Confidence Interval |
|---|---|
| gpt-5 | 69.8%±0.7% |
| gpt-5.4 | 69.3%±0.7% |
| o1-preview-2024-09-12 | 68.0%±0.7% |
| gpt-4.5-preview | 67.6%±0.7% |
| gemini-3.1-pro-preview | 66.4%±0.7% |
| gemini-2.5-pro-preview-05-06 | 66.3%±0.7% |
| claude-sonnet-4-6 | 65.7%±0.7% |
| gpt-4-0125-preview | 65.7%±0.7% |
| gpt-4o | 63.8%±0.8% |
| mistral-large@2407 | 62.8%±0.8% |
| mistral-large-2411@001 | 62.7%±0.8% |
| grok-4-1-fast-reasoning | 62.5%±0.8% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 61.9%±0.8% |
| deepseek-r1 | 60.5%±0.8% |
| grok-4-0709 | 60.5%±0.8% |
| o1-mini | 60.3%±0.8% |
| gpt-4o-mini | 59.2%±0.8% |
| palmyra-x-004 | 59.2%±0.8% |
| grok-beta | 59.2%±0.8% |
| claude-3-7-sonnet@20250219 | 59.0%±0.8% |
| phi-4 | 58.8%±0.8% |
| amazon.nova-pro-v1:0 | 58.7%±0.8% |
| llama-4-maverick-17b-128e-instruct-maas | 58.2%±0.8% |
| claude-3-5-sonnet-v2@20241022 | 57.8%±0.8% |
| llama-3.1-405b-instruct-maas | 57.6%±0.8% |
| palmyra-x-003-instruct | 57.1%±0.8% |
| claude-3-opus@20240229 | 56.7%±0.8% |
| gemini-1.5-pro | 56.4%±0.8% |
| palmyra-fin | 56.3%±0.8% |
| mistral-small-2503@001 | 56.2%±0.8% |
| gemini-2.0-flash-exp | 56.0%±0.8% |
| llama-3.1-70b-instruct-maas | 55.9%±0.8% |
| llama-3.2-90b-vision-instruct-maas | 55.7%±0.8% |
| gpt-4 | 52.6%±0.8% |
| claude-3-5-haiku@20241022 | 52.4%±0.8% |
| gemini-1.5-flash | 51.3%±0.8% |
| Qwen2.5-72b-instruct | 50.9%±0.8% |
| claude-3-sonnet@20240229 | 50.4%±0.8% |
| amazon.nova-lite-v1:0 | 49.7%±0.8% |
| mistral-nemo-2407 | 47.5%±0.8% |
| claude-3-haiku@20240307 | 47.4%±0.8% |
| gemini-1.0-pro | 45.9%±0.8% |
| gpt-3.5-turbo-0125 | 42.9%±0.8% |
| mistral-large-2402-v1 | 40.5%±0.8% |
| jamba-1.5-large@001 | 40.1%±0.8% |
| llama-3.1-8b-instruct-maas | 39.2%±0.8% |
| cohere.command-r-plus-v1:0 | 36.1%±0.8% |
| gpt-3.5-turbo-instruct | 31.1%±0.7% |
| phi-3.5-mini-instruct | 27.3%±0.7% |
| palmyra-med | 23.5%±0.7% |
| codellama-34b-instruct-hf | 18.0%±0.6% |
Language Leaderboards
8 languages · Multilingual reasoning · Ranked independently
| Model | Accuracy & CI |
|---|---|
| gpt-4.5-preview | 68.4%±2.1% |
| gpt-5 | 65.9%±2.1% |
| gemini-3.1-pro-preview | 65.8%±2.1% |
| o1-preview-2024-09-12 | 64.9%±2.1% |
| gpt-5.4 | 64.4%±2.1% |
| gpt-4-0125-preview | 63.6%±2.1% |
| gpt-4o | 63.5%±2.1% |
| gemini-2.5-pro-preview-05-06 | 62.8%±2.1% |
| claude-sonnet-4-6 | 60.3%±2.2% |
| grok-4-1-fast-reasoning | 60.3%±2.2% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 59.8%±2.2% |
| codellama-34b-instruct-hf | 0.2%±0.2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 73.4%±2% |
| gpt-5.4 | 71.9%±2% |
| o1-preview-2024-09-12 | 70.0%±2% |
| claude-sonnet-4-6 | 68.5%±2.1% |
| gemini-2.5-pro-preview-05-06 | 68.1%±2.1% |
| mistral-large@2407 | 67.5%±2.1% |
| gemini-3.1-pro-preview | 67.4%±2.1% |
| gpt-4.5-preview | 67.4%±2.1% |
| gpt-4-0125-preview | 66.2%±2.1% |
| mistral-large-2411@001 | 65.9%±2.1% |
| codellama-34b-instruct-hf | 11.7%±1.4% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 73.5%±2% |
| gpt-5 | 71.5%±2% |
| o1-preview-2024-09-12 | 71.0%±2% |
| gpt-4.5-preview | 70.9%±2% |
| claude-sonnet-4-6 | 67.1%±2.1% |
| gpt-4o | 66.6%±2.1% |
| gpt-4-0125-preview | 66.6%±2.1% |
| grok-4-1-fast-reasoning | 65.3%±2.1% |
| phi-4 | 65.1%±2.1% |
| palmyra-x-004 | 65.1%±2.1% |
| palmyra-med | 30.5%±2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 72.5%±2% |
| gpt-4.5-preview | 71.6%±2% |
| o1-preview-2024-09-12 | 71.5%±2% |
| gpt-5.4 | 70.5%±2% |
| claude-sonnet-4-6 | 70.0%±2% |
| gemini-3.1-pro-preview | 68.7%±2.1% |
| gpt-4-0125-preview | 68.2%±2.1% |
| gpt-4o | 67.5%±2.1% |
| grok-4-1-fast-reasoning | 67.3%±2.1% |
| gemini-2.5-pro-preview-05-06 | 67.2%±2.1% |
| jamba-1.5-large@001 | 21.9%±1.8% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 70.6%±2% |
| gpt-5.4 | 68.8%±2% |
| gpt-4-0125-preview | 68.3%±2.1% |
| gpt-4.5-preview | 67.8%±2.1% |
| mistral-large@2407 | 66.6%±2.1% |
| gpt-4o | 66.5%±2.1% |
| mistral-large-2411@001 | 65.7%±2.1% |
| o1-preview-2024-09-12 | 65.6%±2.1% |
| claude-sonnet-4-6 | 64.2%±2.1% |
| gemini-3.1-pro-preview | 64.0%±2.1% |
| palmyra-med | 28.8%±2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 68.9%±2% |
| gemini-2.5-pro-preview-05-06 | 68.4%±2.1% |
| gemini-3.1-pro-preview | 67.8%±2.1% |
| gpt-5 | 67.0%±2.1% |
| o1-preview-2024-09-12 | 66.9%±2.1% |
| claude-sonnet-4-6 | 64.2%±2.1% |
| gpt-4.5-preview | 63.8%±2.1% |
| mistral-large@2407 | 63.5%±2.1% |
| mistral-large-2411@001 | 63.5%±2.1% |
| gpt-4-0125-preview | 63.3%±2.1% |
| codellama-34b-instruct-hf | 13.3%±1.5% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 69.2%±2% |
| gpt-5 | 68.3%±2.1% |
| gemini-2.5-pro-preview-05-06 | 68.3%±2.1% |
| gemini-3.1-pro-preview | 67.4%±2.1% |
| claude-sonnet-4-6 | 66.3%±2.1% |
| o1-preview-2024-09-12 | 65.3%±2.1% |
| gpt-4-0125-preview | 64.5%±2.1% |
| gpt-4.5-preview | 62.7%±2.1% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 61.4%±2.2% |
| mistral-large@2407 | 60.5%±2.2% |
| codellama-34b-instruct-hf | 3.2%±0.8% |
Domain Leaderboards
4 domains · Multilingual reasoning · Ranked independently
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 83.4%±1.3% |
| gemini-3.1-pro-preview | 83.1%±1.3% |
| claude-sonnet-4-6 | 83.1%±1.3% |
| gpt-5 | 82.0%±1.3% |
| grok-4-1-fast-reasoning | 81.4%±1.4% |
| gemini-2.5-pro-preview-05-06 | 80.9%±1.4% |
| gpt-4-0125-preview | 80.3%±1.4% |
| o1-preview-2024-09-12 | 79.8%±1.4% |
| gpt-4.5-preview | 78.4%±1.4% |
| grok-4-0709 | 77.6%±1.5% |
| codellama-34b-instruct-hf | 17.6%±1.3% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 76.5%±1.2% |
| o1-preview-2024-09-12 | 72.3%±1.3% |
| gemini-2.5-pro-preview-05-06 | 71.8%±1.3% |
| gemini-3.1-pro-preview | 70.9%±1.3% |
| gpt-4-0125-preview | 70.9%±1.3% |
| gpt-5.4 | 70.7%±1.3% |
| gpt-4.5-preview | 70.2%±1.3% |
| deepseek-r1 | 67.7%±1.3% |
| o1-mini | 67.3%±1.3% |
| claude-sonnet-4-6 | 67.1%±1.3% |
| codellama-34b-instruct-hf | 16.9%±1.1% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 76.0%±1.3% |
| gpt-4.5-preview | 75.0%±1.3% |
| o1-preview-2024-09-12 | 73.2%±1.4% |
| gpt-4-0125-preview | 71.9%±1.4% |
| gemini-3.1-pro-preview | 71.7%±1.4% |
| claude-sonnet-4-6 | 71.2%±1.4% |
| gpt-5 | 70.5%±1.4% |
| gpt-4o | 69.9%±1.4% |
| gemini-2.5-pro-preview-05-06 | 69.7%±1.4% |
| mistral-large-2411@001 | 69.4%±1.4% |
| phi-3.5-mini-instruct | 19.2%±1.2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 50.3%±1.6% |
| gpt-5.4 | 48.7%±1.6% |
| gpt-4.5-preview | 47.3%±1.6% |
| o1-preview-2024-09-12 | 47.2%±1.6% |
| gpt-4o | 45.0%±1.6% |
| claude-sonnet-4-6 | 43.7%±1.6% |
| gemini-2.5-pro-preview-05-06 | 43.6%±1.6% |
| grok-beta | 43.0%±1.6% |
| mistral-large@2407 | 42.6%±1.6% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 42.3%±1.6% |
| codellama-34b-instruct-hf | 15.9%±1.2% |
Methodology
01 — DATASET DESIGN
Novel, Human-Authored Scenarios
The dataset consists of three components: fact-based scenarios, scenario-based narratives, and question & answer pairs. Domain experts generated novel, fact-based scenarios using terminology from their respective fields. Writers then used those scenarios to produce narratives from different character perspectives. Finally, experts in Cognitive Science, Philosophy, Linguistics, and NLP research generated Q&A pairs based on the causal events in each scenario. Because the stories and questions are entirely original, models cannot rely on memorized training data — their multilingual reasoning capabilities are genuinely tested.
02 — SCENARIO TYPES
Three Causation Categories
Each domain contains six scenarios divided across three types. Standard Causation depicts clear cause-and-effect relationships without norm violations. Normality Violation — Explicit introduces scenarios where at least one explicit norm is violated (a policy, rule, law, or regulation). Normality Violation — Implicit involves violations of informal, unwritten rules such as social norms or everyday conventions. Each scenario includes 9–14 questions depending on type.
03 — QUESTION CATEGORIES
What the Evaluation Measures
Models are evaluated on their ability to identify causal relationships, discern between a cause and a confounder, determine normality violations in a chain of causal events, and perform these tasks in the context of language variation. Question types include binary and multiple-choice formats across four categories: Causal Discovery (Cause), Causal Discovery (Confounder), Language Variation, and Normality Variation. Despite significant advances in LLMs, multilingual reasoning remains a difficult challenge.
04 — TRANSLATION APPROACH
Cross-Lingual Consistency
Stories and questions were originally written in English and professionally translated across seven additional languages. Translators were instructed to avoid word-for-word translation and instead prioritize retaining original semantics while using natural word choices and grammatical structures appropriate to each language. This standardization enables comparison of the same model on the same scenario across languages, while preserving linguistic and cultural naturalness.
05 — DOMAINS
Four Subject Areas
The dataset covers four domains: Legal & Criminal Justice; Health, Medicine & Science; Business, Finance & Economics; and General. Each domain includes scenarios across all three causation types, ensuring that results reflect reasoning capability rather than domain familiarity alone.
06 — STATISTICS
Confidence Intervals
All accuracy scores include approximate 95% confidence intervals. The ± values shown in each leaderboard represent the margin of error around each accuracy estimate, giving a range rather than just a point estimate. Models with overlapping confidence intervals should be considered statistically tied.
Benchmark Complexity
Available public and private reasoning benchmarks are rather simple, failing to fully evaluate causal capabilities. By contrast, Welo Data’s benchmarks include a novel story to provide context to the model followed by a variety of questions that test multiple causal relationships.
COMPETITOR: EXAMPLE 1
Event A: The sun came out.
Event B: John put his sunglasses on.
Question: Which event Caused the other?
Option 1: Event A caused event B.
Option 2: Event B caused the event.
EVALS BY WELO DATA MULTILINGUAL REASONING : EXAMPLE 1
Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.
Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, such as refrigeration of all dairy items at all times. A late shift followed by a morning shift at Beans4All is exhausting…
Question: What caused Yuki to be put on probation?
a. Sam was afraid of Franb.
b. Sam was tired from a late shiftc.
c. Sam dropped the hot beveraged.
d. Yuki cleaned up Sam’s spill, forgetting about the oat milke.
e. Alex stepped in to help at the counter after Sam went homef.
f. Jaime became ill at workg.
g. The hospital staff determined Jaime had been food-poisonedh.
h.The café was fined by the health departmenti.
i. There are no causal relationships.
j. There is not enough information.
EVALS BY WELO DATA MULTILINGUAL REASONING : EXAMPLE 2
Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.
Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, something we take seriously at my job. I worked my way up to manager at Beans4All…
Question: What caused Yuki to be put on probation?
a. Sam was afraid of Franb.
b. Sam was tired from a late shiftc.
c. Sam dropped the hot beveraged.
d. Yuki cleaned up Sam’s spill, forgetting about the oat milke.
e. Alex stepped in to help at the counter after Sam went homef.
f. Jaime became ill at workg.
g. The hospital staff determined Jaime had been food-poisonedh.
h.The café was fined by the health departmenti.
i. There are no causal relationships.
j. There is not enough information.
Related Publications
A Novel Multi-Select Framework for Evaluating Causal Reasoning in LLMs
Welo Data Research Labs
→
Diagnosing Performance Gaps in Causal Reasoning
Welo Data Research Labs
→
Global Security Blind Spots: LLM Safety Failures in Low-Resource Languages
Welo Data Research Labs
→
GET IN TOUCH
Want to see how your model performs?
Work with Welo Data to benchmark your model across the languages and reasoning tasks that matter to your users.