RESEARCH PAPER
A Novel Framework for Testing Causal Reasoning in LLMs
Design, Data Collection, and Evaluation

This Welo Data research paper introduces a novel framework for testing multilingual causal reasoning on large language models.
It critically examines existing benchmarks, revealing their shortcomings in complexity, originality, and linguistic diversity.
The research evaluates over 20 models across six languages—English, Spanish, Japanese, Korean, Turkish, and Arabic—exposing inconsistencies and underperformance, particularly in less-resourced languages.
Key insights include
- Why current causal reasoning benchmarks fall short
- How human-crafted, complex prompts improve evaluation
- Findings on multilingual model accuracy and consistency