RESEARCH PAPER
Multi-Select Causal Reasoning in LLMs
A New Framework for Evaluating Complex AI Behavior

In Summary
Welo Data’s latest research reveals how scoring metrics significantly alter our understanding of LLM performance on causal reasoning tasks.
The study introduces multi-select prompts—requiring models to identify all valid causes—and evaluates eight leading LLMs across five scoring methods: precision, recall, F1 score, complement, and a “trapdoor” metric that penalizes any incorrect selection.
KEY FINDINGS
- Metric choice significantly impacts model rankings and interpretations
- Most models under-select valid causes or over-select irrelevant ones
- Chain-of-thought prompting showed no consistent performance benefit
- Models display distinct behavioral patterns across causal and non-causal tasks