ok, need help! tried finetuning GPT-OSS over the weekend. it works for ~100 steps, then throws a CUDA out-of-memory error my guess is that every so often, all the tokens get routed to a single expert. then training crashes. is there an easy fix? never finetuned an MoE before
😒
95,38K