Evals by Welo Data
Previously known as Model Assessment Suite
Evals by Welo Data measures how frontier and open-source LLMs perform across languages — because the real world is multilingual, and model performance varies significantly across languages and tasks. Our evaluations cover two dimensions: safety and reasoning.
Safety Evaluation
We measure how often models produce unsafe responses to harmful prompts across English, medium-resource, and low-resource languages — and how that rate changes as language resource level decreases.
See the Safety Leaderboard →Multilingual Reasoning Evaluation
We’ve evaluated 51 models across 8 languages and 4 domains using human-validated, natively-authored scenarios designed to surface genuine causal reasoning capabilities — not pattern-matching on seen data.
See the Reasoning Leaderboard →How often do leading LLMs produce unsafe responses — and how does that rate change across language resource levels? This leaderboard measures unsafe response rates across English, medium-resource, and low-resource languages for leading multilingual models. Every model tested became less safe outside English. This benchmark exists to make that visible.
| # | Model | Englishunsafe rate (95% CI) | Medium-resourceunsafe rate (95% CI) | Low-resourceunsafe rate (95% CI) ▲ | DegradationEN → low-resource |
|---|
The Degradation column shows the unsafe rate increase from English to low-resource languages. A large gap does not mean a model is the most dangerous overall — it depends on where the model starts. A model with strong English safety has more room to fall. The metrics that reflect real-world risk for multilingual deployments are the medium and low-resource unsafe rates.
Want to see how your model performs?
Work with Welo Data to benchmark your model across the languages and tasks that matter to your users.
| Model | Accuracy & Confidence Interval |
|---|---|
| gpt-5 | 69.8%±0.7% |
| gpt-5.4 | 69.3%±0.7% |
| o1-preview-2024-09-12 | 68.0%±0.7% |
| gpt-4.5-preview | 67.6%±0.7% |
| gemini-3.1-pro-preview | 66.4%±0.7% |
| gemini-2.5-pro-preview-05-06 | 66.3%±0.7% |
| claude-sonnet-4-6 | 65.7%±0.7% |
| gpt-4-0125-preview | 65.7%±0.7% |
| gpt-4o | 63.8%±0.8% |
| mistral-large@2407 | 62.8%±0.8% |
| mistral-large-2411@001 | 62.7%±0.8% |
| grok-4-1-fast-reasoning | 62.5%±0.8% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 61.9%±0.8% |
| deepseek-r1 | 60.5%±0.8% |
| grok-4-0709 | 60.5%±0.8% |
| o1-mini | 60.3%±0.8% |
| gpt-4o-mini | 59.2%±0.8% |
| palmyra-x-004 | 59.2%±0.8% |
| grok-beta | 59.2%±0.8% |
| claude-3-7-sonnet@20250219 | 59.0%±0.8% |
| phi-4 | 58.8%±0.8% |
| amazon.nova-pro-v1:0 | 58.7%±0.8% |
| llama-4-maverick-17b-128e-instruct-maas | 58.2%±0.8% |
| claude-3-5-sonnet-v2@20241022 | 57.8%±0.8% |
| llama-3.1-405b-instruct-maas | 57.6%±0.8% |
| palmyra-x-003-instruct | 57.1%±0.8% |
| claude-3-opus@20240229 | 56.7%±0.8% |
| gemini-1.5-pro | 56.4%±0.8% |
| palmyra-fin | 56.3%±0.8% |
| mistral-small-2503@001 | 56.2%±0.8% |
| gemini-2.0-flash-exp | 56.0%±0.8% |
| llama-3.1-70b-instruct-maas | 55.9%±0.8% |
| llama-3.2-90b-vision-instruct-maas | 55.7%±0.8% |
| gpt-4 | 52.6%±0.8% |
| claude-3-5-haiku@20241022 | 52.4%±0.8% |
| gemini-1.5-flash | 51.3%±0.8% |
| Qwen2.5-72b-instruct | 50.9%±0.8% |
| claude-3-sonnet@20240229 | 50.4%±0.8% |
| amazon.nova-lite-v1:0 | 49.7%±0.8% |
| mistral-nemo-2407 | 47.5%±0.8% |
| claude-3-haiku@20240307 | 47.4%±0.8% |
| gemini-1.0-pro | 45.9%±0.8% |
| gpt-3.5-turbo-0125 | 42.9%±0.8% |
| mistral-large-2402-v1 | 40.5%±0.8% |
| jamba-1.5-large@001 | 40.1%±0.8% |
| llama-3.1-8b-instruct-maas | 39.2%±0.8% |
| cohere.command-r-plus-v1:0 | 36.1%±0.8% |
| gpt-3.5-turbo-instruct | 31.1%±0.7% |
| phi-3.5-mini-instruct | 27.3%±0.7% |
| palmyra-med | 23.5%±0.7% |
| codellama-34b-instruct-hf | 18.0%±0.6% |
| Model | Accuracy & CI |
|---|---|
| gpt-4.5-preview | 68.4%±2.1% |
| gpt-5 | 65.9%±2.1% |
| gemini-3.1-pro-preview | 65.8%±2.1% |
| o1-preview-2024-09-12 | 64.9%±2.1% |
| gpt-5.4 | 64.4%±2.1% |
| gpt-4-0125-preview | 63.6%±2.1% |
| gpt-4o | 63.5%±2.1% |
| gemini-2.5-pro-preview-05-06 | 62.8%±2.1% |
| claude-sonnet-4-6 | 60.3%±2.2% |
| grok-4-1-fast-reasoning | 60.3%±2.2% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 59.8%±2.2% |
| codellama-34b-instruct-hf | 0.2%±0.2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 73.4%±2% |
| gpt-5.4 | 71.9%±2% |
| o1-preview-2024-09-12 | 70.0%±2% |
| claude-sonnet-4-6 | 68.5%±2.1% |
| gemini-2.5-pro-preview-05-06 | 68.1%±2.1% |
| mistral-large@2407 | 67.5%±2.1% |
| gemini-3.1-pro-preview | 67.4%±2.1% |
| gpt-4.5-preview | 67.4%±2.1% |
| gpt-4-0125-preview | 66.2%±2.1% |
| mistral-large-2411@001 | 65.9%±2.1% |
| codellama-34b-instruct-hf | 11.7%±1.4% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 73.5%±2% |
| gpt-5 | 71.5%±2% |
| o1-preview-2024-09-12 | 71.0%±2% |
| gpt-4.5-preview | 70.9%±2% |
| claude-sonnet-4-6 | 67.1%±2.1% |
| gpt-4o | 66.6%±2.1% |
| gpt-4-0125-preview | 66.6%±2.1% |
| grok-4-1-fast-reasoning | 65.3%±2.1% |
| phi-4 | 65.1%±2.1% |
| palmyra-x-004 | 65.1%±2.1% |
| palmyra-med | 30.5%±2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 72.5%±2% |
| gpt-4.5-preview | 71.6%±2% |
| o1-preview-2024-09-12 | 71.5%±2% |
| gpt-5.4 | 70.5%±2% |
| claude-sonnet-4-6 | 70.0%±2% |
| gemini-3.1-pro-preview | 68.7%±2.1% |
| gpt-4-0125-preview | 68.2%±2.1% |
| gpt-4o | 67.5%±2.1% |
| grok-4-1-fast-reasoning | 67.3%±2.1% |
| gemini-2.5-pro-preview-05-06 | 67.2%±2.1% |
| jamba-1.5-large@001 | 21.9%±1.8% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 70.6%±2% |
| gpt-5.4 | 68.8%±2% |
| gpt-4-0125-preview | 68.3%±2.1% |
| gpt-4.5-preview | 67.8%±2.1% |
| mistral-large@2407 | 66.6%±2.1% |
| gpt-4o | 66.5%±2.1% |
| mistral-large-2411@001 | 65.7%±2.1% |
| o1-preview-2024-09-12 | 65.6%±2.1% |
| claude-sonnet-4-6 | 64.2%±2.1% |
| gemini-3.1-pro-preview | 64.0%±2.1% |
| palmyra-med | 28.8%±2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 68.9%±2% |
| gemini-2.5-pro-preview-05-06 | 68.4%±2.1% |
| gemini-3.1-pro-preview | 67.8%±2.1% |
| gpt-5 | 67.0%±2.1% |
| o1-preview-2024-09-12 | 66.9%±2.1% |
| claude-sonnet-4-6 | 64.2%±2.1% |
| gpt-4.5-preview | 63.8%±2.1% |
| mistral-large@2407 | 63.5%±2.1% |
| mistral-large-2411@001 | 63.5%±2.1% |
| gpt-4-0125-preview | 63.3%±2.1% |
| codellama-34b-instruct-hf | 13.3%±1.5% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 69.2%±2% |
| gpt-5 | 68.3%±2.1% |
| gemini-2.5-pro-preview-05-06 | 68.3%±2.1% |
| gemini-3.1-pro-preview | 67.4%±2.1% |
| claude-sonnet-4-6 | 66.3%±2.1% |
| o1-preview-2024-09-12 | 65.3%±2.1% |
| gpt-4-0125-preview | 64.5%±2.1% |
| gpt-4.5-preview | 62.7%±2.1% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 61.4%±2.2% |
| mistral-large@2407 | 60.5%±2.2% |
| codellama-34b-instruct-hf | 3.2%±0.8% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 69.3%±2% |
| o1-preview-2024-09-12 | 69.0%±2% |
| gpt-4.5-preview | 68.0%±2.1% |
| gpt-5.4 | 67.5%±2.1% |
| gemini-2.5-pro-preview-05-06 | 67.4%±2.1% |
| gemini-3.1-pro-preview | 67.1%±2.1% |
| claude-sonnet-4-6 | 65.3%±2.1% |
| gpt-4-0125-preview | 64.8%±2.1% |
| grok-4-1-fast-reasoning | 62.4%±2.1% |
| gpt-4o | 61.5%±2.2% |
| codellama-34b-instruct-hf | 7.6%±1.2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 83.4%±1.3% |
| gemini-3.1-pro-preview | 83.1%±1.3% |
| claude-sonnet-4-6 | 83.1%±1.3% |
| gpt-5 | 82.0%±1.3% |
| grok-4-1-fast-reasoning | 81.4%±1.4% |
| gemini-2.5-pro-preview-05-06 | 80.9%±1.4% |
| gpt-4-0125-preview | 80.3%±1.4% |
| o1-preview-2024-09-12 | 79.8%±1.4% |
| gpt-4.5-preview | 78.4%±1.4% |
| grok-4-0709 | 77.6%±1.5% |
| codellama-34b-instruct-hf | 17.6%±1.3% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 76.5%±1.2% |
| o1-preview-2024-09-12 | 72.3%±1.3% |
| gemini-2.5-pro-preview-05-06 | 71.8%±1.3% |
| gemini-3.1-pro-preview | 70.9%±1.3% |
| gpt-4-0125-preview | 70.9%±1.3% |
| gpt-5.4 | 70.7%±1.3% |
| gpt-4.5-preview | 70.2%±1.3% |
| deepseek-r1 | 67.7%±1.3% |
| o1-mini | 67.3%±1.3% |
| claude-sonnet-4-6 | 67.1%±1.3% |
| codellama-34b-instruct-hf | 16.9%±1.1% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 76.0%±1.3% |
| gpt-4.5-preview | 75.0%±1.3% |
| o1-preview-2024-09-12 | 73.2%±1.4% |
| gpt-4-0125-preview | 71.9%±1.4% |
| gemini-3.1-pro-preview | 71.7%±1.4% |
| claude-sonnet-4-6 | 71.2%±1.4% |
| gpt-5 | 70.5%±1.4% |
| gpt-4o | 69.9%±1.4% |
| gemini-2.5-pro-preview-05-06 | 69.7%±1.4% |
| mistral-large-2411@001 | 69.4%±1.4% |
| phi-3.5-mini-instruct | 19.2%±1.2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 50.3%±1.6% |
| gpt-5.4 | 48.7%±1.6% |
| gpt-4.5-preview | 47.3%±1.6% |
| o1-preview-2024-09-12 | 47.2%±1.6% |
| gpt-4o | 45.0%±1.6% |
| claude-sonnet-4-6 | 43.7%±1.6% |
| gemini-2.5-pro-preview-05-06 | 43.6%±1.6% |
| grok-beta | 43.0%±1.6% |
| mistral-large@2407 | 42.6%±1.6% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 42.3%±1.6% |
| codellama-34b-instruct-hf | 15.9%±1.2% |
Available public and private reasoning benchmarks are rather simple, failing to fully evaluate causal capabilities. By contrast, Welo Data's benchmarks include a novel story to provide context to the model followed by a variety of questions that test multiple causal relationships.
Event A: The sun came out.
Event B: John put his sunglasses on.
Question: Which event Caused the other?
Option 1: Event A caused event B.
Option 2: Event B caused the event.
Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.
Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, such as refrigeration of all dairy items at all times. A late shift followed by a morning shift at Beans4All is exhausting…
Question: What caused Yuki to be put on probation?
Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.
Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, something we take seriously at my job. I worked my way up to manager at Beans4All…
Question: What caused Yuki to be put on probation?
Want to see how your model performs?
Work with Welo Data to benchmark your model across the languages and reasoning tasks that matter to your users.
Contact Us →