Evals by Welo Data

Evals by Welo Data

Previously known as Model Assessment Suite

Evals by Welo Data measures how frontier and open-source LLMs perform across languages — because the real world is multilingual, and model performance varies significantly across languages and tasks. Our evaluations cover two dimensions: safety and reasoning.

We measure how often models produce unsafe responses to harmful prompts across English, medium-resource, and low-resource languages — and how that rate changes as language resource level decreases.

See the Safety Leaderboard →
9
Models evaluated
51
Languages tested
EN · 15 medium-resource · 35 low-resource

We’ve evaluated 51 models across 8 languages and 4 domains using human-validated, natively-authored scenarios designed to surface genuine causal reasoning capabilities — not pattern-matching on seen data.

See the Reasoning Leaderboard →
8
Languages
4
Domains
AR · DE · EN · ES · FR · JA · KO · TR

Safety Leaderboard

How often do leading LLMs produce unsafe responses — and how does that rate change across language resource levels? This leaderboard measures unsafe response rates across English, medium-resource, and low-resource languages for leading multilingual models. Every model tested became less safe outside English. This benchmark exists to make that visible.

# Model Englishunsafe rate (95% CI) Medium-resourceunsafe rate (95% CI) Low-resourceunsafe rate (95% CI) ▲ DegradationEN → low-resource

The Degradation column shows the unsafe rate increase from English to low-resource languages. A large gap does not mean a model is the most dangerous overall — it depends on where the model starts. A model with strong English safety has more room to fall. The metrics that reflect real-world risk for multilingual deployments are the medium and low-resource unsafe rates.

How it works
01 — PROMPT SELECTION
Curated for maximum signal
50 harmful prompts drawn from the dataset underlying our 2026 white paper, Global Security Blind Spots: LLM Safety Failures in Low-Resource Languages. The original dataset was drawn from publicly available safety benchmarks including AdvBench, HarmBench, MultiJail, PHTest, and DiaSafety. This evaluation spans six categories: Dangerous Behavior & Criminal Content (20), Hate & Discrimination (10), Violence & Threats (10), Self-Harm & Suicide (5), Misinformation & Disinformation (4), and Model Security (1). Prompts were chosen to maximize the gap between English and low-resource unsafe rates. Because the prompts target genuinely harmful outputs, this benchmark tests safety alignment where it is most important — not where models are most likely to succeed.
02 — LANGUAGE TIERS
English, medium-resource, and low-resource
51 languages organized into three tiers. English serves as the alignment baseline. The 15 medium-resource languages include Amharic, Bengali, Georgian, Hausa, Hebrew, Indonesian, Kazakh, Malay, Romanian, Swahili, Tagalog, Tamil, Thai, Ukrainian, and Urdu — languages with meaningful web presence and hundreds of millions of speakers, but without English-level safety alignment training. The 35 low-resource languages are selected from an empirically evaluated pool of 78 languages, prioritizing those with the largest English-to-low-resource safety gaps while maintaining linguistic diversity. 23 language families are represented.
03 — EVALUATION PIPELINE
Translate, collect, back-translate, judge
Prompts were translated from English into each target language using the Google Cloud Translation API. Model responses were collected in the target language, then back-translated into English using the same API. A consistent English-language safety judge assigned a binary safe/unsafe label to each back-translated response. This approach allows identical evaluation criteria to be applied across all models and all languages. Prompt translations were performed once and reused across models; back-translations were performed per model.
04 — CONFIDENCE INTERVALS
What the ± values mean
All unsafe rates are reported with 95% confidence intervals. English confidence intervals are wider than medium- and low-resource intervals because the English sample is 50 prompt-response pairs per model, versus 750 for medium-resource and 1,750 for low-resource. Models with overlapping confidence intervals should be considered statistically tied.
Get in touch

Want to see how your model performs?

Work with Welo Data to benchmark your model across the languages and tasks that matter to your users.

Contact Us →
Multilingual Causal Reasoning Leaderboard
Overall Rankings
ModelAccuracy & Confidence Interval
gpt-5
69.8%±0.7%
gpt-5.4
69.3%±0.7%
o1-preview-2024-09-12
68.0%±0.7%
gpt-4.5-preview
67.6%±0.7%
gemini-3.1-pro-preview
66.4%±0.7%
gemini-2.5-pro-preview-05-06
66.3%±0.7%
claude-sonnet-4-6
65.7%±0.7%
gpt-4-0125-preview
65.7%±0.7%
gpt-4o
63.8%±0.8%
mistral-large@2407
62.8%±0.8%
mistral-large-2411@001
62.7%±0.8%
grok-4-1-fast-reasoning
62.5%±0.8%
qwen.qwen3-235b-a22b-2507-v1:0
61.9%±0.8%
deepseek-r1
60.5%±0.8%
grok-4-0709
60.5%±0.8%
o1-mini
60.3%±0.8%
gpt-4o-mini
59.2%±0.8%
palmyra-x-004
59.2%±0.8%
grok-beta
59.2%±0.8%
claude-3-7-sonnet@20250219
59.0%±0.8%
phi-4
58.8%±0.8%
amazon.nova-pro-v1:0
58.7%±0.8%
llama-4-maverick-17b-128e-instruct-maas
58.2%±0.8%
claude-3-5-sonnet-v2@20241022
57.8%±0.8%
llama-3.1-405b-instruct-maas
57.6%±0.8%
palmyra-x-003-instruct
57.1%±0.8%
claude-3-opus@20240229
56.7%±0.8%
gemini-1.5-pro
56.4%±0.8%
palmyra-fin
56.3%±0.8%
mistral-small-2503@001
56.2%±0.8%
gemini-2.0-flash-exp
56.0%±0.8%
llama-3.1-70b-instruct-maas
55.9%±0.8%
llama-3.2-90b-vision-instruct-maas
55.7%±0.8%
gpt-4
52.6%±0.8%
claude-3-5-haiku@20241022
52.4%±0.8%
gemini-1.5-flash
51.3%±0.8%
Qwen2.5-72b-instruct
50.9%±0.8%
claude-3-sonnet@20240229
50.4%±0.8%
amazon.nova-lite-v1:0
49.7%±0.8%
mistral-nemo-2407
47.5%±0.8%
claude-3-haiku@20240307
47.4%±0.8%
gemini-1.0-pro
45.9%±0.8%
gpt-3.5-turbo-0125
42.9%±0.8%
mistral-large-2402-v1
40.5%±0.8%
jamba-1.5-large@001
40.1%±0.8%
llama-3.1-8b-instruct-maas
39.2%±0.8%
cohere.command-r-plus-v1:0
36.1%±0.8%
gpt-3.5-turbo-instruct
31.1%±0.7%
phi-3.5-mini-instruct
27.3%±0.7%
palmyra-med
23.5%±0.7%
codellama-34b-instruct-hf
18.0%±0.6%
Language Leaderboards
ModelAccuracy & CI
gpt-4.5-preview
68.4%±2.1%
gpt-5
65.9%±2.1%
gemini-3.1-pro-preview
65.8%±2.1%
o1-preview-2024-09-12
64.9%±2.1%
gpt-5.4
64.4%±2.1%
gpt-4-0125-preview
63.6%±2.1%
gpt-4o
63.5%±2.1%
gemini-2.5-pro-preview-05-06
62.8%±2.1%
claude-sonnet-4-6
60.3%±2.2%
grok-4-1-fast-reasoning
60.3%±2.2%
qwen.qwen3-235b-a22b-2507-v1:0
59.8%±2.2%
codellama-34b-instruct-hf
0.2%±0.2%
ModelAccuracy & CI
gpt-5
73.4%±2%
gpt-5.4
71.9%±2%
o1-preview-2024-09-12
70.0%±2%
claude-sonnet-4-6
68.5%±2.1%
gemini-2.5-pro-preview-05-06
68.1%±2.1%
mistral-large@2407
67.5%±2.1%
gemini-3.1-pro-preview
67.4%±2.1%
gpt-4.5-preview
67.4%±2.1%
gpt-4-0125-preview
66.2%±2.1%
mistral-large-2411@001
65.9%±2.1%
codellama-34b-instruct-hf
11.7%±1.4%
ModelAccuracy & CI
gpt-5.4
73.5%±2%
gpt-5
71.5%±2%
o1-preview-2024-09-12
71.0%±2%
gpt-4.5-preview
70.9%±2%
claude-sonnet-4-6
67.1%±2.1%
gpt-4o
66.6%±2.1%
gpt-4-0125-preview
66.6%±2.1%
grok-4-1-fast-reasoning
65.3%±2.1%
phi-4
65.1%±2.1%
palmyra-x-004
65.1%±2.1%
palmyra-med
30.5%±2%
ModelAccuracy & CI
gpt-5
72.5%±2%
gpt-4.5-preview
71.6%±2%
o1-preview-2024-09-12
71.5%±2%
gpt-5.4
70.5%±2%
claude-sonnet-4-6
70.0%±2%
gemini-3.1-pro-preview
68.7%±2.1%
gpt-4-0125-preview
68.2%±2.1%
gpt-4o
67.5%±2.1%
grok-4-1-fast-reasoning
67.3%±2.1%
gemini-2.5-pro-preview-05-06
67.2%±2.1%
jamba-1.5-large@001
21.9%±1.8%
ModelAccuracy & CI
gpt-5
70.6%±2%
gpt-5.4
68.8%±2%
gpt-4-0125-preview
68.3%±2.1%
gpt-4.5-preview
67.8%±2.1%
mistral-large@2407
66.6%±2.1%
gpt-4o
66.5%±2.1%
mistral-large-2411@001
65.7%±2.1%
o1-preview-2024-09-12
65.6%±2.1%
claude-sonnet-4-6
64.2%±2.1%
gemini-3.1-pro-preview
64.0%±2.1%
palmyra-med
28.8%±2%
ModelAccuracy & CI
gpt-5.4
68.9%±2%
gemini-2.5-pro-preview-05-06
68.4%±2.1%
gemini-3.1-pro-preview
67.8%±2.1%
gpt-5
67.0%±2.1%
o1-preview-2024-09-12
66.9%±2.1%
claude-sonnet-4-6
64.2%±2.1%
gpt-4.5-preview
63.8%±2.1%
mistral-large@2407
63.5%±2.1%
mistral-large-2411@001
63.5%±2.1%
gpt-4-0125-preview
63.3%±2.1%
codellama-34b-instruct-hf
13.3%±1.5%
ModelAccuracy & CI
gpt-5.4
69.2%±2%
gpt-5
68.3%±2.1%
gemini-2.5-pro-preview-05-06
68.3%±2.1%
gemini-3.1-pro-preview
67.4%±2.1%
claude-sonnet-4-6
66.3%±2.1%
o1-preview-2024-09-12
65.3%±2.1%
gpt-4-0125-preview
64.5%±2.1%
gpt-4.5-preview
62.7%±2.1%
qwen.qwen3-235b-a22b-2507-v1:0
61.4%±2.2%
mistral-large@2407
60.5%±2.2%
codellama-34b-instruct-hf
3.2%±0.8%
ModelAccuracy & CI
gpt-5
69.3%±2%
o1-preview-2024-09-12
69.0%±2%
gpt-4.5-preview
68.0%±2.1%
gpt-5.4
67.5%±2.1%
gemini-2.5-pro-preview-05-06
67.4%±2.1%
gemini-3.1-pro-preview
67.1%±2.1%
claude-sonnet-4-6
65.3%±2.1%
gpt-4-0125-preview
64.8%±2.1%
grok-4-1-fast-reasoning
62.4%±2.1%
gpt-4o
61.5%±2.2%
codellama-34b-instruct-hf
7.6%±1.2%
Domain Leaderboards
ModelAccuracy & CI
gpt-5.4
83.4%±1.3%
gemini-3.1-pro-preview
83.1%±1.3%
claude-sonnet-4-6
83.1%±1.3%
gpt-5
82.0%±1.3%
grok-4-1-fast-reasoning
81.4%±1.4%
gemini-2.5-pro-preview-05-06
80.9%±1.4%
gpt-4-0125-preview
80.3%±1.4%
o1-preview-2024-09-12
79.8%±1.4%
gpt-4.5-preview
78.4%±1.4%
grok-4-0709
77.6%±1.5%
codellama-34b-instruct-hf
17.6%±1.3%
ModelAccuracy & CI
gpt-5
76.5%±1.2%
o1-preview-2024-09-12
72.3%±1.3%
gemini-2.5-pro-preview-05-06
71.8%±1.3%
gemini-3.1-pro-preview
70.9%±1.3%
gpt-4-0125-preview
70.9%±1.3%
gpt-5.4
70.7%±1.3%
gpt-4.5-preview
70.2%±1.3%
deepseek-r1
67.7%±1.3%
o1-mini
67.3%±1.3%
claude-sonnet-4-6
67.1%±1.3%
codellama-34b-instruct-hf
16.9%±1.1%
ModelAccuracy & CI
gpt-5.4
76.0%±1.3%
gpt-4.5-preview
75.0%±1.3%
o1-preview-2024-09-12
73.2%±1.4%
gpt-4-0125-preview
71.9%±1.4%
gemini-3.1-pro-preview
71.7%±1.4%
claude-sonnet-4-6
71.2%±1.4%
gpt-5
70.5%±1.4%
gpt-4o
69.9%±1.4%
gemini-2.5-pro-preview-05-06
69.7%±1.4%
mistral-large-2411@001
69.4%±1.4%
phi-3.5-mini-instruct
19.2%±1.2%
Methodology
01 — DATASET DESIGN
Novel, Human-Authored Scenarios
The dataset consists of three components: fact-based scenarios, scenario-based narratives, and question & answer pairs. Domain experts generated novel, fact-based scenarios using terminology from their respective fields. Writers then used those scenarios to produce narratives from different character perspectives. Finally, experts in Cognitive Science, Philosophy, Linguistics, and NLP research generated Q&A pairs based on the causal events in each scenario. Because the stories and questions are entirely original, models cannot rely on memorized training data — their multilingual reasoning capabilities are genuinely tested.
02 — SCENARIO TYPES
Three Causation Categories
Each domain contains six scenarios divided across three types. Standard Causation depicts clear cause-and-effect relationships without norm violations. Normality Violation — Explicit introduces scenarios where at least one explicit norm is violated (a policy, rule, law, or regulation). Normality Violation — Implicit involves violations of informal, unwritten rules such as social norms or everyday conventions. Each scenario includes 9–14 questions depending on type.
03 — QUESTION CATEGORIES
What the Evaluation Measures
Models are evaluated on their ability to identify causal relationships, discern between a cause and a confounder, determine normality violations in a chain of causal events, and perform these tasks in the context of language variation. Question types include binary and multiple-choice formats across four categories: Causal Discovery (Cause), Causal Discovery (Confounder), Language Variation, and Normality Variation. Despite significant advances in LLMs, multilingual reasoning remains a difficult challenge.
04 — TRANSLATION APPROACH
Cross-Lingual Consistency
Stories and questions were originally written in English and professionally translated across seven additional languages. Translators were instructed to avoid word-for-word translation and instead prioritize retaining original semantics while using natural word choices and grammatical structures appropriate to each language. This standardization enables comparison of the same model on the same scenario across languages, while preserving linguistic and cultural naturalness.
05 — DOMAINS
Four Subject Areas
The dataset covers four domains: Legal & Criminal Justice; Health, Medicine & Science; Business, Finance & Economics; and General. Each domain includes scenarios across all three causation types, ensuring that results reflect reasoning capability rather than domain familiarity alone.
06 — STATISTICS
Confidence Intervals
All accuracy scores include approximate 95% confidence intervals. The ± values shown in each leaderboard represent the margin of error around each accuracy estimate, giving a range rather than just a point estimate. Models with overlapping confidence intervals should be considered statistically tied.
Benchmark Complexity

Available public and private reasoning benchmarks are rather simple, failing to fully evaluate causal capabilities. By contrast, Welo Data's benchmarks include a novel story to provide context to the model followed by a variety of questions that test multiple causal relationships.

Competitor: Example 1

Event A: The sun came out.

Event B: John put his sunglasses on.

Question: Which event Caused the other?

Option 1: Event A caused event B.

Option 2: Event B caused the event.

Evals by Welo Data Multilingual Reasoning : Example 1

Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.

Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, such as refrigeration of all dairy items at all times. A late shift followed by a morning shift at Beans4All is exhausting…

Question: What caused Yuki to be put on probation?

a. Sam was afraid of Fran b. Sam was tired from a late shift c. Sam dropped the hot beverage d. Yuki cleaned up Sam's spill, forgetting about the oat milk e. Alex stepped in to help at the counter after Sam went home f. Jaime became ill at work g. The hospital staff determined Jaime had been food-poisoned h. The café was fined by the health department i. There are no causal relationships. j. There is not enough information.
Evals by Welo Data Multilingual Reasoning : Example 2

Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.

Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, something we take seriously at my job. I worked my way up to manager at Beans4All…

Question: What caused Yuki to be put on probation?

a. Sam was afraid of Fran b. Sam was tired from a late shift c. Sam dropped the hot beverage d. Yuki cleaned up Sam's spill, forgetting about the oat milk e. Alex stepped in to help at the counter after Sam went home f. Jaime became ill at work g. The hospital staff determined Jaime had been food-poisoned h. The café was fined by the health department i. There are no causal relationships. j. There is not enough information.
Related Publications
Get in touch

Want to see how your model performs?

Work with Welo Data to benchmark your model across the languages and reasoning tasks that matter to your users.

Contact Us →