Evals by Welo Data
Previously known as Model Assessment Suite
Evals by Welo Data measures how frontier and open-source LLMs perform across languages — because the real world is multilingual, and model performance varies significantly across languages and tasks. Our evaluations span 8 languages (Arabic, German, English, Spanish, French, Japanese, Korean, and Turkish) and 4 domains (Legal & Criminal Justice; Health, Medicine & Science; Business, Finance & Economics; and General).
Our evaluations focus on multilingual reasoning. Despite significant advances in LLMs, multilingual reasoning remains a difficult challenge. We've evaluated 51 models using human-validated, natively-authored scenarios and evaluation items designed to surface genuine reasoning capabilities — not pattern-matching on seen data.
| Model | Accuracy & Confidence Interval |
|---|---|
| gpt-5 | 69.8%±0.7% |
| gpt-5.4 | 69.3%±0.7% |
| o1-preview-2024-09-12 | 68.0%±0.7% |
| gpt-4.5-preview | 67.6%±0.7% |
| gemini-3.1-pro-preview | 66.4%±0.7% |
| gemini-2.5-pro-preview-05-06 | 66.3%±0.7% |
| claude-sonnet-4-6 | 65.7%±0.7% |
| gpt-4-0125-preview | 65.7%±0.7% |
| gpt-4o | 63.8%±0.8% |
| mistral-large@2407 | 62.8%±0.8% |
| mistral-large-2411@001 | 62.7%±0.8% |
| grok-4-1-fast-reasoning | 62.5%±0.8% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 61.9%±0.8% |
| deepseek-r1 | 60.5%±0.8% |
| grok-4-0709 | 60.5%±0.8% |
| o1-mini | 60.3%±0.8% |
| gpt-4o-mini | 59.2%±0.8% |
| palmyra-x-004 | 59.2%±0.8% |
| grok-beta | 59.2%±0.8% |
| claude-3-7-sonnet@20250219 | 59.0%±0.8% |
| phi-4 | 58.8%±0.8% |
| amazon.nova-pro-v1:0 | 58.7%±0.8% |
| llama-4-maverick-17b-128e-instruct-maas | 58.2%±0.8% |
| claude-3-5-sonnet-v2@20241022 | 57.8%±0.8% |
| llama-3.1-405b-instruct-maas | 57.6%±0.8% |
| palmyra-x-003-instruct | 57.1%±0.8% |
| claude-3-opus@20240229 | 56.7%±0.8% |
| gemini-1.5-pro | 56.4%±0.8% |
| palmyra-fin | 56.3%±0.8% |
| mistral-small-2503@001 | 56.2%±0.8% |
| gemini-2.0-flash-exp | 56.0%±0.8% |
| llama-3.1-70b-instruct-maas | 55.9%±0.8% |
| llama-3.2-90b-vision-instruct-maas | 55.7%±0.8% |
| gpt-4 | 52.6%±0.8% |
| claude-3-5-haiku@20241022 | 52.4%±0.8% |
| gemini-1.5-flash | 51.3%±0.8% |
| Qwen2.5-72b-instruct | 50.9%±0.8% |
| claude-3-sonnet@20240229 | 50.4%±0.8% |
| amazon.nova-lite-v1:0 | 49.7%±0.8% |
| mistral-nemo-2407 | 47.5%±0.8% |
| claude-3-haiku@20240307 | 47.4%±0.8% |
| gemini-1.0-pro | 45.9%±0.8% |
| gpt-3.5-turbo-0125 | 42.9%±0.8% |
| mistral-large-2402-v1 | 40.5%±0.8% |
| jamba-1.5-large@001 | 40.1%±0.8% |
| llama-3.1-8b-instruct-maas | 39.2%±0.8% |
| cohere.command-r-plus-v1:0 | 36.1%±0.8% |
| gpt-3.5-turbo-instruct | 31.1%±0.7% |
| phi-3.5-mini-instruct | 27.3%±0.7% |
| palmyra-med | 23.5%±0.7% |
| codellama-34b-instruct-hf | 18.0%±0.6% |
| Model | Accuracy & CI |
|---|---|
| gpt-4.5-preview | 68.4%±2.1% |
| gpt-5 | 65.9%±2.1% |
| gemini-3.1-pro-preview | 65.8%±2.1% |
| o1-preview-2024-09-12 | 64.9%±2.1% |
| gpt-5.4 | 64.4%±2.1% |
| gpt-4-0125-preview | 63.6%±2.1% |
| gpt-4o | 63.5%±2.1% |
| gemini-2.5-pro-preview-05-06 | 62.8%±2.1% |
| claude-sonnet-4-6 | 60.3%±2.2% |
| grok-4-1-fast-reasoning | 60.3%±2.2% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 59.8%±2.2% |
| codellama-34b-instruct-hf | 0.2%±0.2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 73.4%±2% |
| gpt-5.4 | 71.9%±2% |
| o1-preview-2024-09-12 | 70.0%±2% |
| claude-sonnet-4-6 | 68.5%±2.1% |
| gemini-2.5-pro-preview-05-06 | 68.1%±2.1% |
| mistral-large@2407 | 67.5%±2.1% |
| gemini-3.1-pro-preview | 67.4%±2.1% |
| gpt-4.5-preview | 67.4%±2.1% |
| gpt-4-0125-preview | 66.2%±2.1% |
| mistral-large-2411@001 | 65.9%±2.1% |
| codellama-34b-instruct-hf | 11.7%±1.4% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 73.5%±2% |
| gpt-5 | 71.5%±2% |
| o1-preview-2024-09-12 | 71.0%±2% |
| gpt-4.5-preview | 70.9%±2% |
| claude-sonnet-4-6 | 67.1%±2.1% |
| gpt-4o | 66.6%±2.1% |
| gpt-4-0125-preview | 66.6%±2.1% |
| grok-4-1-fast-reasoning | 65.3%±2.1% |
| phi-4 | 65.1%±2.1% |
| palmyra-x-004 | 65.1%±2.1% |
| palmyra-med | 30.5%±2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 72.5%±2% |
| gpt-4.5-preview | 71.6%±2% |
| o1-preview-2024-09-12 | 71.5%±2% |
| gpt-5.4 | 70.5%±2% |
| claude-sonnet-4-6 | 70.0%±2% |
| gemini-3.1-pro-preview | 68.7%±2.1% |
| gpt-4-0125-preview | 68.2%±2.1% |
| gpt-4o | 67.5%±2.1% |
| grok-4-1-fast-reasoning | 67.3%±2.1% |
| gemini-2.5-pro-preview-05-06 | 67.2%±2.1% |
| jamba-1.5-large@001 | 21.9%±1.8% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 70.6%±2% |
| gpt-5.4 | 68.8%±2% |
| gpt-4-0125-preview | 68.3%±2.1% |
| gpt-4.5-preview | 67.8%±2.1% |
| mistral-large@2407 | 66.6%±2.1% |
| gpt-4o | 66.5%±2.1% |
| mistral-large-2411@001 | 65.7%±2.1% |
| o1-preview-2024-09-12 | 65.6%±2.1% |
| claude-sonnet-4-6 | 64.2%±2.1% |
| gemini-3.1-pro-preview | 64.0%±2.1% |
| palmyra-med | 28.8%±2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 68.9%±2% |
| gemini-2.5-pro-preview-05-06 | 68.4%±2.1% |
| gemini-3.1-pro-preview | 67.8%±2.1% |
| gpt-5 | 67.0%±2.1% |
| o1-preview-2024-09-12 | 66.9%±2.1% |
| claude-sonnet-4-6 | 64.2%±2.1% |
| gpt-4.5-preview | 63.8%±2.1% |
| mistral-large@2407 | 63.5%±2.1% |
| mistral-large-2411@001 | 63.5%±2.1% |
| gpt-4-0125-preview | 63.3%±2.1% |
| codellama-34b-instruct-hf | 13.3%±1.5% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 69.2%±2% |
| gpt-5 | 68.3%±2.1% |
| gemini-2.5-pro-preview-05-06 | 68.3%±2.1% |
| gemini-3.1-pro-preview | 67.4%±2.1% |
| claude-sonnet-4-6 | 66.3%±2.1% |
| o1-preview-2024-09-12 | 65.3%±2.1% |
| gpt-4-0125-preview | 64.5%±2.1% |
| gpt-4.5-preview | 62.7%±2.1% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 61.4%±2.2% |
| mistral-large@2407 | 60.5%±2.2% |
| codellama-34b-instruct-hf | 3.2%±0.8% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 69.3%±2% |
| o1-preview-2024-09-12 | 69.0%±2% |
| gpt-4.5-preview | 68.0%±2.1% |
| gpt-5.4 | 67.5%±2.1% |
| gemini-2.5-pro-preview-05-06 | 67.4%±2.1% |
| gemini-3.1-pro-preview | 67.1%±2.1% |
| claude-sonnet-4-6 | 65.3%±2.1% |
| gpt-4-0125-preview | 64.8%±2.1% |
| grok-4-1-fast-reasoning | 62.4%±2.1% |
| gpt-4o | 61.5%±2.2% |
| codellama-34b-instruct-hf | 7.6%±1.2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 83.4%±1.3% |
| gemini-3.1-pro-preview | 83.1%±1.3% |
| claude-sonnet-4-6 | 83.1%±1.3% |
| gpt-5 | 82.0%±1.3% |
| grok-4-1-fast-reasoning | 81.4%±1.4% |
| gemini-2.5-pro-preview-05-06 | 80.9%±1.4% |
| gpt-4-0125-preview | 80.3%±1.4% |
| o1-preview-2024-09-12 | 79.8%±1.4% |
| gpt-4.5-preview | 78.4%±1.4% |
| grok-4-0709 | 77.6%±1.5% |
| codellama-34b-instruct-hf | 17.6%±1.3% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 76.5%±1.2% |
| o1-preview-2024-09-12 | 72.3%±1.3% |
| gemini-2.5-pro-preview-05-06 | 71.8%±1.3% |
| gemini-3.1-pro-preview | 70.9%±1.3% |
| gpt-4-0125-preview | 70.9%±1.3% |
| gpt-5.4 | 70.7%±1.3% |
| gpt-4.5-preview | 70.2%±1.3% |
| deepseek-r1 | 67.7%±1.3% |
| o1-mini | 67.3%±1.3% |
| claude-sonnet-4-6 | 67.1%±1.3% |
| codellama-34b-instruct-hf | 16.9%±1.1% |
| Model | Accuracy & CI |
|---|---|
| gpt-5.4 | 76.0%±1.3% |
| gpt-4.5-preview | 75.0%±1.3% |
| o1-preview-2024-09-12 | 73.2%±1.4% |
| gpt-4-0125-preview | 71.9%±1.4% |
| gemini-3.1-pro-preview | 71.7%±1.4% |
| claude-sonnet-4-6 | 71.2%±1.4% |
| gpt-5 | 70.5%±1.4% |
| gpt-4o | 69.9%±1.4% |
| gemini-2.5-pro-preview-05-06 | 69.7%±1.4% |
| mistral-large-2411@001 | 69.4%±1.4% |
| phi-3.5-mini-instruct | 19.2%±1.2% |
| Model | Accuracy & CI |
|---|---|
| gpt-5 | 50.3%±1.6% |
| gpt-5.4 | 48.7%±1.6% |
| gpt-4.5-preview | 47.3%±1.6% |
| o1-preview-2024-09-12 | 47.2%±1.6% |
| gpt-4o | 45.0%±1.6% |
| claude-sonnet-4-6 | 43.7%±1.6% |
| gemini-2.5-pro-preview-05-06 | 43.6%±1.6% |
| grok-beta | 43.0%±1.6% |
| mistral-large@2407 | 42.6%±1.6% |
| qwen.qwen3-235b-a22b-2507-v1:0 | 42.3%±1.6% |
| codellama-34b-instruct-hf | 15.9%±1.2% |
Available public and private reasoning benchmarks are rather simple, failing to fully evaluate causal capabilities. By contrast, Welo Data's benchmarks include a novel story to provide context to the model followed by a variety of questions that test multiple causal relationships.
Event A: The sun came out.
Event B: John put his sunglasses on.
Question: Which event Caused the other?
Option 1: Event A caused event B.
Option 2: Event B caused the event.
Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.
Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, such as refrigeration of all dairy items at all times. A late shift followed by a morning shift at Beans4All is exhausting…
Question: What caused Yuki to be put on probation?
Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.
Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, something we take seriously at my job. I worked my way up to manager at Beans4All…
Question: What caused Yuki to be put on probation?
Want to see how your model performs?
Work with Welo Data to benchmark your model across the languages and reasoning tasks that matter to your users.
Contact Us →