Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents
Breakpoint
Goal: automated way to create a benchmark where you get to tune the difficulty of the long-term system-level coding
Method: took existing codebases, and either
- removed a function (local-level change)
- corrupted a function, and didn’t tell the LLM which function
- corrupted multiple functions, and didn’t tell the LLM which
How difficult a function is is based on its # lines of code and cyclomatic complexity How difficult a corruption is is based on its harmonic distance from the rest of the code base through function call graph
Results:
- Only for function removal: higher # lines of code and lower harmonic distance are harder to reconstruct
- for corruptions, bottleneck seems to be in finding which function to fix
- scaling test-time compute (# submissions to test cases) improves both local and system level fixing, but mostly only up to 2 submissions
- Use the Mann-Whitney U test to determine that reasoning mostly helps local, not system level, scaling ttc helps both