RESEARCH PAPER

Diagnosing Performance Gaps in Causal Reasoning via Bilingual
Prompting in LLMs

A multilingual benchmark analysis across 565,632 prompts and six languages

In Summary

  • How bilingual prompts impact causal reasoning accuracy 
  • Where popular LLMs struggle with multilingual reasoning tasks 
  • How task type and language shape LLM reasoning accuracy 
  • Why English doesn’t always lead to better performance 
  • What prompt and response biases mean for real-world AI