We’re thinking of making our internal benchmarking tool for coding agents (not models) like claude code/cursor/codex/amp/devin/etc more publicly accessible DM me if you’re interested in seeing it and providing feedback on what you’d like to see!
718