When it comes to AI distributed training, I find that people in the web2AI circle often label it as a "false proposition," arguing that while computing power can be aggregated, effective distributed collaboration incurs terrifying bandwidth costs. Recently, @0G_labs published the DiLoCox paper, which seems to aim at solving this problem? Let's discuss in detail: 1) First, let's talk about why distributed training is considered a "false proposition." The core contradiction is simple: you want to replace 100 cheap GPUs with 100 A100s by aggregating them, which seems to save 90% on hardware costs, but these 100 GPUs need to maintain synchronized training, requiring the exchange of terabytes of gradient data every epoch. Traditional solutions require a dedicated bandwidth of 100Gbps, and to achieve such data center-level networks, monthly fees can reach hundreds of thousands of dollars. When you calculate it, the money saved on GPUs is all spent on bandwidth, and you might even end up losing money. According to this logic, while you save on machine costs, you incur additional bandwidth costs, which means the problem isn't really solved? This is why it has been criticized as a false proposition. 2) The reason 0G's DiLoCoX paper has attracted attention is that they claim to have trained a 107B parameter model on a 1Gbps network (typical office bandwidth), achieving a speed 357 times faster than traditional AllReduce solutions. This number is indeed explosive—considering that 1Gbps vs 100Gbps represents a 100-fold difference in bandwidth, yet the training speed increased by 357 times? How did they achieve this? After some research, I found that this solution implemented four optimizations: - Pipeline Parallelism processes model slices in segments; - Dual Optimizer Policy reduces synchronization frequency with a dual optimizer strategy; - One-Step-Delay Overlap allows communication and computation to run in parallel without waiting on each other; - Adaptive Gradient Compression intelligently compresses gradients. In simpler terms, they changed the original requirement of "real-time strong synchronization" to "asynchronous weak synchronization," and transformed "full data transmission" into "compressed incremental transmission." To put it metaphorically, traditional solutions are like a real-time video conference with 100 people, where every action of each person must be synchronized live, while DiLoCoX is like everyone recording separately and only sending key frames and changes. The communication volume is reduced by 100 times, but the completeness of the information remains above 99%. Why is this feasible? In my view, the core lies in their grasp of a characteristic of AI training—fault tolerance. Training a model is not like transferring money, where even a penny off is unacceptable. A slight error in gradient updates or a bit of delay in synchronization has a negligible impact on the final model convergence. DiLoCoX leverages this "fault tolerance space" to exchange acceptable precision loss for an order of magnitude increase in efficiency. This is typical engineering thinking—not pursuing perfection, but seeking optimal cost-effectiveness. 3) However, merely solving the bandwidth issue is not enough; 0G's ambitions are clearly greater. Their overall architecture reveals this: they also have a Storage layer priced at $10/TB, directly claiming to crush Filecoin, and a DA layer specifically designed for AI, achieving GB-level throughput. The reason they can achieve a design that makes storage 100 times cheaper is that they have made special optimizations for AI training scenarios. For example, the TB-level data generated during the training process, such as checkpoints and logs, only has a lifecycle of a few days, so there is no need for strict "permanent storage." Thus, they have adopted a pragmatic "tiered storage" solution, providing the appropriate level of service only when needed—hot data is read and written quickly but is a bit more expensive, cold data is cheaper but slower, and temporary data is deleted after use, making it the cheapest. Moreover, this differentiated pricing directly addresses the core issues of AI training. From the above, it is clear that 0G Labs has intentionally adapted to the issues of computing power, storage, and data flow in the AI training process. They have even optimized the consensus mechanism for AI, using a modified version of CometBFT, achieving 2500+ TPS with sub-second finality, specifically tuned for the asynchronous characteristics of AI workloads, etc. In other words, 0G is not "patching" existing blockchains to support AI; they are designing a set of "AI Native" infrastructure from scratch. Whether this can ultimately achieve application-level commercial validation under the pressure of competition with traditional AI remains to be seen, but this differentiated breakthrough approach is quite worth emulating.
4,96K