As evals for models become saturated, “evals” for agents will become the next frontier of what matters Will be harder to procure too since it’s so specific/task-based Is Devin better than Amp at refactoring? Is Codex better than Claude Code for debugging?
311