RESEARCH PAPER

Multi-Select Causal Reasoning in LLMs

A New Framework for Evaluating Complex AI Behavior

In Summary

Welo Data’s latest research reveals how scoring metrics significantly alter our understanding of LLM performance on causal reasoning tasks.

The study introduces multi-select prompts—requiring models to identify all valid causes—and evaluates eight leading LLMs across five scoring methods: precision, recall, F1 score, complement, and a “trapdoor” metric that penalizes any incorrect selection.

KEY FINDINGS

Metric choice significantly impacts model rankings and interpretations
Most models under-select valid causes or over-select irrelevant ones
Chain-of-thought prompting showed no consistent performance benefit
Models display distinct behavioral patterns across causal and non-causal tasks

Gen AI

AI/ML Models

Model Assessment Suite | Evaluation Tools

Research Lab

About Us

Multi-Select Causal Reasoning in LLMs

A New Framework for Evaluating Complex AI Behavior

In Summary