Evals by Welo Data

Evals by Welo Data

Previously known as Model Assessment Suite

Evals by Welo Data measures how frontier and open-source LLMs perform across languages — because the real world is multilingual, and model performance varies significantly across languages and tasks. Our evaluations span 8 languages (Arabic, German, English, Spanish, French, Japanese, Korean, and Turkish) and 4 domains (Legal & Criminal Justice; Health, Medicine & Science; Business, Finance & Economics; and General).

Our evaluations focus on multilingual reasoning. Despite significant advances in LLMs, multilingual reasoning remains a difficult challenge. We've evaluated 51 models using human-validated, natively-authored scenarios and evaluation items designed to surface genuine reasoning capabilities — not pattern-matching on seen data.

8
Languages
4
Domains
AR · DE · EN · ES · FR · JA · KO · TR
Multilingual Reasoning
Evaluation Domain
See how your model performs →
Overall RankingsOverall Leaderboard
ModelAccuracy & Confidence Interval
gpt-5
69.8%±0.7%
gpt-5.4
69.3%±0.7%
o1-preview-2024-09-12
68.0%±0.7%
gpt-4.5-preview
67.6%±0.7%
gemini-3.1-pro-preview
66.4%±0.7%
gemini-2.5-pro-preview-05-06
66.3%±0.7%
claude-sonnet-4-6
65.7%±0.7%
gpt-4-0125-preview
65.7%±0.7%
gpt-4o
63.8%±0.8%
mistral-large@2407
62.8%±0.8%
mistral-large-2411@001
62.7%±0.8%
grok-4-1-fast-reasoning
62.5%±0.8%
qwen.qwen3-235b-a22b-2507-v1:0
61.9%±0.8%
deepseek-r1
60.5%±0.8%
grok-4-0709
60.5%±0.8%
o1-mini
60.3%±0.8%
gpt-4o-mini
59.2%±0.8%
palmyra-x-004
59.2%±0.8%
grok-beta
59.2%±0.8%
claude-3-7-sonnet@20250219
59.0%±0.8%
phi-4
58.8%±0.8%
amazon.nova-pro-v1:0
58.7%±0.8%
llama-4-maverick-17b-128e-instruct-maas
58.2%±0.8%
claude-3-5-sonnet-v2@20241022
57.8%±0.8%
llama-3.1-405b-instruct-maas
57.6%±0.8%
palmyra-x-003-instruct
57.1%±0.8%
claude-3-opus@20240229
56.7%±0.8%
gemini-1.5-pro
56.4%±0.8%
palmyra-fin
56.3%±0.8%
mistral-small-2503@001
56.2%±0.8%
gemini-2.0-flash-exp
56.0%±0.8%
llama-3.1-70b-instruct-maas
55.9%±0.8%
llama-3.2-90b-vision-instruct-maas
55.7%±0.8%
gpt-4
52.6%±0.8%
claude-3-5-haiku@20241022
52.4%±0.8%
gemini-1.5-flash
51.3%±0.8%
Qwen2.5-72b-instruct
50.9%±0.8%
claude-3-sonnet@20240229
50.4%±0.8%
amazon.nova-lite-v1:0
49.7%±0.8%
mistral-nemo-2407
47.5%±0.8%
claude-3-haiku@20240307
47.4%±0.8%
gemini-1.0-pro
45.9%±0.8%
gpt-3.5-turbo-0125
42.9%±0.8%
mistral-large-2402-v1
40.5%±0.8%
jamba-1.5-large@001
40.1%±0.8%
llama-3.1-8b-instruct-maas
39.2%±0.8%
cohere.command-r-plus-v1:0
36.1%±0.8%
gpt-3.5-turbo-instruct
31.1%±0.7%
phi-3.5-mini-instruct
27.3%±0.7%
palmyra-med
23.5%±0.7%
codellama-34b-instruct-hf
18.0%±0.6%
By LanguageLanguage Leaderboards
ModelAccuracy & CI
gpt-4.5-preview
68.4%±2.1%
gpt-5
65.9%±2.1%
gemini-3.1-pro-preview
65.8%±2.1%
o1-preview-2024-09-12
64.9%±2.1%
gpt-5.4
64.4%±2.1%
gpt-4-0125-preview
63.6%±2.1%
gpt-4o
63.5%±2.1%
gemini-2.5-pro-preview-05-06
62.8%±2.1%
claude-sonnet-4-6
60.3%±2.2%
grok-4-1-fast-reasoning
60.3%±2.2%
qwen.qwen3-235b-a22b-2507-v1:0
59.8%±2.2%
codellama-34b-instruct-hf
0.2%±0.2%
ModelAccuracy & CI
gpt-5
73.4%±2%
gpt-5.4
71.9%±2%
o1-preview-2024-09-12
70.0%±2%
claude-sonnet-4-6
68.5%±2.1%
gemini-2.5-pro-preview-05-06
68.1%±2.1%
mistral-large@2407
67.5%±2.1%
gemini-3.1-pro-preview
67.4%±2.1%
gpt-4.5-preview
67.4%±2.1%
gpt-4-0125-preview
66.2%±2.1%
mistral-large-2411@001
65.9%±2.1%
codellama-34b-instruct-hf
11.7%±1.4%
ModelAccuracy & CI
gpt-5.4
73.5%±2%
gpt-5
71.5%±2%
o1-preview-2024-09-12
71.0%±2%
gpt-4.5-preview
70.9%±2%
claude-sonnet-4-6
67.1%±2.1%
gpt-4o
66.6%±2.1%
gpt-4-0125-preview
66.6%±2.1%
grok-4-1-fast-reasoning
65.3%±2.1%
phi-4
65.1%±2.1%
palmyra-x-004
65.1%±2.1%
palmyra-med
30.5%±2%
ModelAccuracy & CI
gpt-5
72.5%±2%
gpt-4.5-preview
71.6%±2%
o1-preview-2024-09-12
71.5%±2%
gpt-5.4
70.5%±2%
claude-sonnet-4-6
70.0%±2%
gemini-3.1-pro-preview
68.7%±2.1%
gpt-4-0125-preview
68.2%±2.1%
gpt-4o
67.5%±2.1%
grok-4-1-fast-reasoning
67.3%±2.1%
gemini-2.5-pro-preview-05-06
67.2%±2.1%
jamba-1.5-large@001
21.9%±1.8%
ModelAccuracy & CI
gpt-5
70.6%±2%
gpt-5.4
68.8%±2%
gpt-4-0125-preview
68.3%±2.1%
gpt-4.5-preview
67.8%±2.1%
mistral-large@2407
66.6%±2.1%
gpt-4o
66.5%±2.1%
mistral-large-2411@001
65.7%±2.1%
o1-preview-2024-09-12
65.6%±2.1%
claude-sonnet-4-6
64.2%±2.1%
gemini-3.1-pro-preview
64.0%±2.1%
palmyra-med
28.8%±2%
ModelAccuracy & CI
gpt-5.4
68.9%±2%
gemini-2.5-pro-preview-05-06
68.4%±2.1%
gemini-3.1-pro-preview
67.8%±2.1%
gpt-5
67.0%±2.1%
o1-preview-2024-09-12
66.9%±2.1%
claude-sonnet-4-6
64.2%±2.1%
gpt-4.5-preview
63.8%±2.1%
mistral-large@2407
63.5%±2.1%
mistral-large-2411@001
63.5%±2.1%
gpt-4-0125-preview
63.3%±2.1%
codellama-34b-instruct-hf
13.3%±1.5%
ModelAccuracy & CI
gpt-5.4
69.2%±2%
gpt-5
68.3%±2.1%
gemini-2.5-pro-preview-05-06
68.3%±2.1%
gemini-3.1-pro-preview
67.4%±2.1%
claude-sonnet-4-6
66.3%±2.1%
o1-preview-2024-09-12
65.3%±2.1%
gpt-4-0125-preview
64.5%±2.1%
gpt-4.5-preview
62.7%±2.1%
qwen.qwen3-235b-a22b-2507-v1:0
61.4%±2.2%
mistral-large@2407
60.5%±2.2%
codellama-34b-instruct-hf
3.2%±0.8%
ModelAccuracy & CI
gpt-5
69.3%±2%
o1-preview-2024-09-12
69.0%±2%
gpt-4.5-preview
68.0%±2.1%
gpt-5.4
67.5%±2.1%
gemini-2.5-pro-preview-05-06
67.4%±2.1%
gemini-3.1-pro-preview
67.1%±2.1%
claude-sonnet-4-6
65.3%±2.1%
gpt-4-0125-preview
64.8%±2.1%
grok-4-1-fast-reasoning
62.4%±2.1%
gpt-4o
61.5%±2.2%
codellama-34b-instruct-hf
7.6%±1.2%
By DomainDomain Leaderboards
ModelAccuracy & CI
gpt-5.4
83.4%±1.3%
gemini-3.1-pro-preview
83.1%±1.3%
claude-sonnet-4-6
83.1%±1.3%
gpt-5
82.0%±1.3%
grok-4-1-fast-reasoning
81.4%±1.4%
gemini-2.5-pro-preview-05-06
80.9%±1.4%
gpt-4-0125-preview
80.3%±1.4%
o1-preview-2024-09-12
79.8%±1.4%
gpt-4.5-preview
78.4%±1.4%
grok-4-0709
77.6%±1.5%
codellama-34b-instruct-hf
17.6%±1.3%
ModelAccuracy & CI
gpt-5
76.5%±1.2%
o1-preview-2024-09-12
72.3%±1.3%
gemini-2.5-pro-preview-05-06
71.8%±1.3%
gemini-3.1-pro-preview
70.9%±1.3%
gpt-4-0125-preview
70.9%±1.3%
gpt-5.4
70.7%±1.3%
gpt-4.5-preview
70.2%±1.3%
deepseek-r1
67.7%±1.3%
o1-mini
67.3%±1.3%
claude-sonnet-4-6
67.1%±1.3%
codellama-34b-instruct-hf
16.9%±1.1%
ModelAccuracy & CI
gpt-5.4
76.0%±1.3%
gpt-4.5-preview
75.0%±1.3%
o1-preview-2024-09-12
73.2%±1.4%
gpt-4-0125-preview
71.9%±1.4%
gemini-3.1-pro-preview
71.7%±1.4%
claude-sonnet-4-6
71.2%±1.4%
gpt-5
70.5%±1.4%
gpt-4o
69.9%±1.4%
gemini-2.5-pro-preview-05-06
69.7%±1.4%
mistral-large-2411@001
69.4%±1.4%
phi-3.5-mini-instruct
19.2%±1.2%
How it worksMethodology
01 — DATASET DESIGN
Novel, Human-Authored Scenarios
The dataset consists of three components: fact-based scenarios, scenario-based narratives, and question & answer pairs. Domain experts generated novel, fact-based scenarios using terminology from their respective fields. Writers then used those scenarios to produce narratives from different character perspectives. Finally, experts in Cognitive Science, Philosophy, Linguistics, and NLP research generated Q&A pairs based on the causal events in each scenario. Because the stories and questions are entirely original, models cannot rely on memorized training data — their multilingual reasoning capabilities are genuinely tested.
02 — SCENARIO TYPES
Three Causation Categories
Each domain contains six scenarios divided across three types. Standard Causation depicts clear cause-and-effect relationships without norm violations. Normality Violation — Explicit introduces scenarios where at least one explicit norm is violated (a policy, rule, law, or regulation). Normality Violation — Implicit involves violations of informal, unwritten rules such as social norms or everyday conventions. Each scenario includes 9–14 questions depending on type.
03 — QUESTION CATEGORIES
What the Evaluation Measures
Models are evaluated on their ability to identify causal relationships, discern between a cause and a confounder, determine normality violations in a chain of causal events, and perform these tasks in the context of language variation. Question types include binary and multiple-choice formats across four categories: Causal Discovery (Cause), Causal Discovery (Confounder), Language Variation, and Normality Variation. Despite significant advances in LLMs, multilingual reasoning remains a difficult challenge.
04 — TRANSLATION APPROACH
Cross-Lingual Consistency
Stories and questions were originally written in English and professionally translated across seven additional languages. Translators were instructed to avoid word-for-word translation and instead prioritize retaining original semantics while using natural word choices and grammatical structures appropriate to each language. This standardization enables comparison of the same model on the same scenario across languages, while preserving linguistic and cultural naturalness.
05 — DOMAINS
Four Subject Areas
The dataset covers four domains: Legal & Criminal Justice; Health, Medicine & Science; Business, Finance & Economics; and General. Each domain includes scenarios across all three causation types, ensuring that results reflect reasoning capability rather than domain familiarity alone.
06 — STATISTICS
Confidence Intervals
All accuracy scores include approximate 95% confidence intervals. The ± values shown in each leaderboard represent the margin of error around each accuracy estimate, giving a range rather than just a point estimate. Models with overlapping confidence intervals should be considered statistically tied.
Why We Are DifferentBenchmark Complexity

Available public and private reasoning benchmarks are rather simple, failing to fully evaluate causal capabilities. By contrast, Welo Data's benchmarks include a novel story to provide context to the model followed by a variety of questions that test multiple causal relationships.

Competitor: Example 1

Event A: The sun came out.

Event B: John put his sunglasses on.

Question: Which event Caused the other?

Option 1: Event A caused event B.

Option 2: Event B caused the event.

Evals by Welo Data Multilingual Reasoning : Example 1

Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.

Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, such as refrigeration of all dairy items at all times. A late shift followed by a morning shift at Beans4All is exhausting…

Question: What caused Yuki to be put on probation?

a. Sam was afraid of Fran b. Sam was tired from a late shift c. Sam dropped the hot beverage d. Yuki cleaned up Sam's spill, forgetting about the oat milk e. Alex stepped in to help at the counter after Sam went home f. Jaime became ill at work g. The hospital staff determined Jaime had been food-poisoned h. The café was fined by the health department i. There are no causal relationships. j. There is not enough information.
Evals by Welo Data Multilingual Reasoning : Example 2

Prompt: You are a helpful assistant for causal relationship understanding. Review the following story and think about the cause-and-effect relationships. Then, answer the question that follows the story.

Story: The Massachusetts Department of Health requires all food-related establishments to adhere to specific rules to ensure food safety, something we take seriously at my job. I worked my way up to manager at Beans4All…

Question: What caused Yuki to be put on probation?

a. Sam was afraid of Fran b. Sam was tired from a late shift c. Sam dropped the hot beverage d. Yuki cleaned up Sam's spill, forgetting about the oat milk e. Alex stepped in to help at the counter after Sam went home f. Jaime became ill at work g. The hospital staff determined Jaime had been food-poisoned h. The café was fined by the health department i. There are no causal relationships. j. There is not enough information.
ResearchRelated Publications
Get in touch

Want to see how your model performs?

Work with Welo Data to benchmark your model across the languages and reasoning tasks that matter to your users.

Contact Us →