GPT-OSS underperforms even on benchmarks that require raw tool calling. For example, CORE-Bench requires agents to run bash commands to reproduce scientific papers. DeepSeek V3 scores 18%. GPT-OSS scores 11%.
Nathan Lambert
Nathan Lambert12.8. klo 23.44
gpt-oss is a tool processing / reasoning engine only. Kind of a hard open model to use. Traction imo will be limited. Best way to get traction is to release models that are flexible, easy to use w/o tools, and reliable. Then, bespoke interesting models like tool use later
6,1K