Evals by Welo Data

Evals by Welo Data measures how frontier and open-source LLMs perform across languages — because the real world is multilingual, and model performance varies significantly across languages and tasks. Our evaluations cover two dimensions: safety and reasoning.


Safety Evaluation

We measure how often models produce unsafe responses to harmful prompts across English, medium-resource, and low-resource languages — and how that rate changes as language resource level decreases.


Multilingual Reasoning Evaluation

We’ve evaluated 51 models across 8 languages and 4 domains using human-validated, natively-authored scenarios designed to surface genuine causal reasoning capabilities — not pattern-matching on seen data.

Safety Leaderboard


How often do leading LLMs produce unsafe responses — and how does that rate change across language resource levels? This leaderboard measures unsafe response rates across English, medium-resource, and low-resource languages for leading multilingual models. Every model tested became less safe outside English. This benchmark exists to make that visible.

# Model Englishunsafe rate (95% CI) Medium-resourceunsafe rate (95% CI) Low-resourceunsafe rate (95% CI) ▲ DegradationEN → low-resource

The Degradation column shows the unsafe rate increase from English to low-resource languages. The metrics that reflect real-world risk for multilingual deployments are the medium and low-resource unsafe rates.

How it works


Curated for maximum signal

English, medium-resource, and low-resource

Translate, collect, back-translate, judge

What the ± values mean

Multilingual Causal Reasoning Leaderboard


Overall Rankings
ModelAccuracy & Confidence Interval
gpt-5
69.8%±0.7%
gpt-5.4
69.3%±0.7%
o1-preview-2024-09-12
68.0%±0.7%
gpt-4.5-preview
67.6%±0.7%
gemini-3.1-pro-preview
66.4%±0.7%
gemini-2.5-pro-preview-05-06
66.3%±0.7%
claude-sonnet-4-6
65.7%±0.7%
gpt-4-0125-preview
65.7%±0.7%
gpt-4o
63.8%±0.8%
mistral-large@2407
62.8%±0.8%
mistral-large-2411@001
62.7%±0.8%
grok-4-1-fast-reasoning
62.5%±0.8%
qwen.qwen3-235b-a22b-2507-v1:0
61.9%±0.8%
deepseek-r1
60.5%±0.8%
grok-4-0709
60.5%±0.8%
o1-mini
60.3%±0.8%
gpt-4o-mini
59.2%±0.8%
palmyra-x-004
59.2%±0.8%
grok-beta
59.2%±0.8%
claude-3-7-sonnet@20250219
59.0%±0.8%
phi-4
58.8%±0.8%
amazon.nova-pro-v1:0
58.7%±0.8%
llama-4-maverick-17b-128e-instruct-maas
58.2%±0.8%
claude-3-5-sonnet-v2@20241022
57.8%±0.8%
llama-3.1-405b-instruct-maas
57.6%±0.8%
palmyra-x-003-instruct
57.1%±0.8%
claude-3-opus@20240229
56.7%±0.8%
gemini-1.5-pro
56.4%±0.8%
palmyra-fin
56.3%±0.8%
mistral-small-2503@001
56.2%±0.8%
gemini-2.0-flash-exp
56.0%±0.8%
llama-3.1-70b-instruct-maas
55.9%±0.8%
llama-3.2-90b-vision-instruct-maas
55.7%±0.8%
gpt-4
52.6%±0.8%
claude-3-5-haiku@20241022
52.4%±0.8%
gemini-1.5-flash
51.3%±0.8%
Qwen2.5-72b-instruct
50.9%±0.8%
claude-3-sonnet@20240229
50.4%±0.8%
amazon.nova-lite-v1:0
49.7%±0.8%
mistral-nemo-2407
47.5%±0.8%
claude-3-haiku@20240307
47.4%±0.8%
gemini-1.0-pro
45.9%±0.8%
gpt-3.5-turbo-0125
42.9%±0.8%
mistral-large-2402-v1
40.5%±0.8%
jamba-1.5-large@001
40.1%±0.8%
llama-3.1-8b-instruct-maas
39.2%±0.8%
cohere.command-r-plus-v1:0
36.1%±0.8%
gpt-3.5-turbo-instruct
31.1%±0.7%
phi-3.5-mini-instruct
27.3%±0.7%
palmyra-med
23.5%±0.7%
codellama-34b-instruct-hf
18.0%±0.6%

Language Leaderboards


ModelAccuracy & CI
gpt-4.5-preview
68.4%±2.1%
gpt-5
65.9%±2.1%
gemini-3.1-pro-preview
65.8%±2.1%
o1-preview-2024-09-12
64.9%±2.1%
gpt-5.4
64.4%±2.1%
gpt-4-0125-preview
63.6%±2.1%
gpt-4o
63.5%±2.1%
gemini-2.5-pro-preview-05-06
62.8%±2.1%
claude-sonnet-4-6
60.3%±2.2%
grok-4-1-fast-reasoning
60.3%±2.2%
qwen.qwen3-235b-a22b-2507-v1:0
59.8%±2.2%
codellama-34b-instruct-hf
0.2%±0.2%
ModelAccuracy & CI
gpt-5
73.4%±2%
gpt-5.4
71.9%±2%
o1-preview-2024-09-12
70.0%±2%
claude-sonnet-4-6
68.5%±2.1%
gemini-2.5-pro-preview-05-06
68.1%±2.1%
mistral-large@2407
67.5%±2.1%
gemini-3.1-pro-preview
67.4%±2.1%
gpt-4.5-preview
67.4%±2.1%
gpt-4-0125-preview
66.2%±2.1%
mistral-large-2411@001
65.9%±2.1%
codellama-34b-instruct-hf
11.7%±1.4%
ModelAccuracy & CI
gpt-5.4
73.5%±2%
gpt-5
71.5%±2%
o1-preview-2024-09-12
71.0%±2%
gpt-4.5-preview
70.9%±2%
claude-sonnet-4-6
67.1%±2.1%
gpt-4o
66.6%±2.1%
gpt-4-0125-preview
66.6%±2.1%
grok-4-1-fast-reasoning
65.3%±2.1%
phi-4
65.1%±2.1%
palmyra-x-004
65.1%±2.1%
palmyra-med
30.5%±2%
ModelAccuracy & CI
gpt-5
72.5%±2%
gpt-4.5-preview
71.6%±2%
o1-preview-2024-09-12
71.5%±2%
gpt-5.4
70.5%±2%
claude-sonnet-4-6
70.0%±2%
gemini-3.1-pro-preview
68.7%±2.1%
gpt-4-0125-preview
68.2%±2.1%
gpt-4o
67.5%±2.1%
grok-4-1-fast-reasoning
67.3%±2.1%
gemini-2.5-pro-preview-05-06
67.2%±2.1%
jamba-1.5-large@001
21.9%±1.8%
ModelAccuracy & CI
gpt-5
70.6%±2%
gpt-5.4
68.8%±2%
gpt-4-0125-preview
68.3%±2.1%
gpt-4.5-preview
67.8%±2.1%
mistral-large@2407
66.6%±2.1%
gpt-4o
66.5%±2.1%
mistral-large-2411@001
65.7%±2.1%
o1-preview-2024-09-12
65.6%±2.1%
claude-sonnet-4-6
64.2%±2.1%
gemini-3.1-pro-preview
64.0%±2.1%
palmyra-med
28.8%±2%
ModelAccuracy & CI
gpt-5.4
68.9%±2%
gemini-2.5-pro-preview-05-06
68.4%±2.1%
gemini-3.1-pro-preview
67.8%±2.1%
gpt-5
67.0%±2.1%
o1-preview-2024-09-12
66.9%±2.1%
claude-sonnet-4-6
64.2%±2.1%
gpt-4.5-preview
63.8%±2.1%
mistral-large@2407
63.5%±2.1%
mistral-large-2411@001
63.5%±2.1%
gpt-4-0125-preview
63.3%±2.1%
codellama-34b-instruct-hf
13.3%±1.5%
ModelAccuracy & CI
gpt-5.4
69.2%±2%
gpt-5
68.3%±2.1%
gemini-2.5-pro-preview-05-06
68.3%±2.1%
gemini-3.1-pro-preview
67.4%±2.1%
claude-sonnet-4-6
66.3%±2.1%
o1-preview-2024-09-12
65.3%±2.1%
gpt-4-0125-preview
64.5%±2.1%
gpt-4.5-preview
62.7%±2.1%
qwen.qwen3-235b-a22b-2507-v1:0
61.4%±2.2%
mistral-large@2407
60.5%±2.2%
codellama-34b-instruct-hf
3.2%±0.8%

Domain Leaderboards


ModelAccuracy & CI
gpt-5.4
83.4%±1.3%
gemini-3.1-pro-preview
83.1%±1.3%
claude-sonnet-4-6
83.1%±1.3%
gpt-5
82.0%±1.3%
grok-4-1-fast-reasoning
81.4%±1.4%
gemini-2.5-pro-preview-05-06
80.9%±1.4%
gpt-4-0125-preview
80.3%±1.4%
o1-preview-2024-09-12
79.8%±1.4%
gpt-4.5-preview
78.4%±1.4%
grok-4-0709
77.6%±1.5%
codellama-34b-instruct-hf
17.6%±1.3%
ModelAccuracy & CI
gpt-5
76.5%±1.2%
o1-preview-2024-09-12
72.3%±1.3%
gemini-2.5-pro-preview-05-06
71.8%±1.3%
gemini-3.1-pro-preview
70.9%±1.3%
gpt-4-0125-preview
70.9%±1.3%
gpt-5.4
70.7%±1.3%
gpt-4.5-preview
70.2%±1.3%
deepseek-r1
67.7%±1.3%
o1-mini
67.3%±1.3%
claude-sonnet-4-6
67.1%±1.3%
codellama-34b-instruct-hf
16.9%±1.1%
ModelAccuracy & CI
gpt-5.4
76.0%±1.3%
gpt-4.5-preview
75.0%±1.3%
o1-preview-2024-09-12
73.2%±1.4%
gpt-4-0125-preview
71.9%±1.4%
gemini-3.1-pro-preview
71.7%±1.4%
claude-sonnet-4-6
71.2%±1.4%
gpt-5
70.5%±1.4%
gpt-4o
69.9%±1.4%
gemini-2.5-pro-preview-05-06
69.7%±1.4%
mistral-large-2411@001
69.4%±1.4%
phi-3.5-mini-instruct
19.2%±1.2%

Methodology


Novel, Human-Authored Scenarios

Three Causation Categories

What the Evaluation Measures

Cross-Lingual Consistency

Four Subject Areas

Confidence Intervals

Benchmark Complexity


Available public and private reasoning benchmarks are rather simple, failing to fully evaluate causal capabilities. By contrast, Welo Data’s benchmarks include a novel story to provide context to the model followed by a variety of questions that test multiple causal relationships.

Question: Which event Caused the other?

Want to see how your model performs?

Work with Welo Data to benchmark your model across the languages and reasoning tasks that matter to your users.