I'm very confident that we've accumulated the largest set of IRL task-based evals for coding agents like codex, claude code, cursor, amp, devin, etc. over the past few weeks with @askModuAI Need to figure out a way to benchmarking publicly accessible
408