Agent coding benchmark tests such as SWE-bench and Terminal-Bench are widely used to compare the software engineering capabilities of state-of-the-art AI models. The top positions on these benchmark ...