RESEARCH PAPER

  • Why current causal reasoning benchmarks fall short 
  • How human-crafted, complex prompts improve evaluation 
  • Findings on multilingual model accuracy and consistency