BRIDGE: Predicting Human Task Completion Time From Model Performance

Fit IRT model, regress log of human completion time against IRT ability

Was reading this because it’s a related work to our agent psychometrics paper. Skimmed through it so that’s why the notes are sparse.

all benchmarks are agentic

METR Data: SWAA, HCAST, RE-Bench

human time annotations
multiple trials per model-task pair, they treat it as successful if model succeeds in at least 50% of trials

MLE Bench: 75 kaggle questions

for each question, made it into 3 tasks: whether the agent produced a valid submission, whether the submission achieves above-median performance, and whether it earns any Kaggle medal

GDPVal: economically significant real world significant tasks for 44 domains

Cybench: CTF tasks

for running agents: InspectAI with ReAct scaffold (bash + python interpreter), 1000 turns or context window full, 0 temperature