The Inflated Pass

The Inflated Pass

Code agents are evaluated on benchmarks like SWE-bench: given a bug report and a repository, generate a patch. Success means the patch passes the test suite. Published results report impressive pass rates. The agents appear to be solving real software engineering problems.

SWE-ABS (arXiv:2603.00520) augments the test suites adversarially — adding tests that target the same bug from different angles, test edge cases the original suite missed, and verify properties the original tests assumed rather than checked. Under strengthened testing, reported success rates drop significantly. Agents that appeared to solve problems were generating patches that passed tests without fixing the underlying bug.

The evaluation infrastructure has the same class of deficiency it is trying to measure. The benchmarks test whether agents produce correct code by running that code against test suites. But the test suites themselves are incomplete — they accept incorrect solutions. The benchmark’s signal is bounded by the quality of its tests, and the tests are not strong enough to distinguish genuine fixes from patches that happen to pass.

This is not a limitation of adversarial testing specifically. It is a limitation of test-suite-based evaluation generally. Any finite set of tests defines a space of passing implementations that is larger than the space of correct implementations. Agents that optimize for “pass the tests” are solving a different problem than “fix the bug.” The two problems overlap — correct fixes pass tests — but they are not identical, and the gap is where inflated pass rates live.

The agents are not failing at software engineering. They are succeeding at a different task: test-suite satisfaction. The benchmark measures the intersection of correctness and test coverage, and reports it as correctness alone.


Write a comment
No comments yet.