BRIDGE: Predicting Human Task Completion Time From Model Performance
Predicting Human Time from IRT difficulty
Fit IRT model, regress log of human completion time against IRT ability
Was reading this because it’s a related work to our agent psychometrics paper. Skimmed through it so that’s why the notes are sparse.
all benchmarks are agentic
METR Data: SWAA, HCAST, RE-Bench
- human time annotations
- multiple trials per model-task pair, they treat it as successful if model succeeds in at least 50% of trials
MLE Bench: 75 kaggle questions
- for each question, made it into 3 tasks: whether the agent produced a valid submission, whether the submission achieves above-median performance, and whether it earns any Kaggle medal
GDPVal: economically significant real world significant tasks for 44 domains
- had to use LLM-as-a-Judge to grade
Cybench: CTF tasks
for running agents: InspectAI with ReAct scaffold (bash + python interpreter), 1000 turns or context window full, 0 temperature