so after playing with this for most of the day, neither could do it (expected) but gpt5 via codex gave up a lot and would just crash (example below). That said, what did end up working is having GPT5 create the detailed spec based on the arxiv paper and then review the opus code
xjdr
xjdr10.8. klo 00.50
"How do you benchmark new models?"
you have to know what you are doing to direct traffic and i had to create the test harnesses and pass criteria myself, but their powers combined made something that rivals my existing version. pretty impressive initial test if im being honest ...
23,87K