BRIDGE: Predicting Human Task Completion Time From Model Performance

Predicting Human Time from IRT difficulty

Fit IRT model, regress log of human completion time against IRT ability

Was reading this because it’s a related work to our agent psychometrics paper. Skimmed through it so that’s why the notes are sparse.

all benchmarks are agentic

METR Data: SWAA, HCAST, RE-Bench

  • human time annotations
  • multiple trials per model-task pair, they treat it as successful if model succeeds in at least 50% of trials

MLE Bench: 75 kaggle questions

  • for each question, made it into 3 tasks: whether the agent produced a valid submission, whether the submission achieves above-median performance, and whether it earns any Kaggle medal

GDPVal: economically significant real world significant tasks for 44 domains

  • had to use LLM-as-a-Judge to grade

Cybench: CTF tasks

for running agents: InspectAI with ReAct scaffold (bash + python interpreter), 1000 turns or context window full, 0 temperature