RESEARCH PAPER
Diagnosing Performance Gaps in Causal Reasoning via Bilingual
Prompting in LLMs
A multilingual benchmark analysis across 565,632 prompts and six languages

In Summary
Welo Data’s latest research introduces a multilingual evaluation framework that reveals how large language models (LLMs) perform when reasoning across languages.
By evaluating model responses to prompts containing content in one language and questions in another, the study identifies subtle but consistent drops in accuracy, language-driven reasoning biases, and the influence of model architecture.
What you’ll learn:
- How bilingual prompts impact causal reasoning accuracy
- Where popular LLMs struggle with multilingual reasoning tasks
- How task type and language shape LLM reasoning accuracy
- Why English doesn’t always lead to better performance
- What prompt and response biases mean for real-world AI