The secret behind Parallax’s performance lies in key server-grade optimizations: – Continuous batching: dynamically groups requests to maximize hardware utilization and throughput. – Paged KV-Cache: block-based design prevents memory fragmentation, handles thousands of concurrent requests – Heterogeneous workers: optimal performance across GPUs and Macs. Fast. Reliable. Fully distributed.
Gradient
Gradient20.6.2025
Compared to Petals (BitTorrent-style serving), Parallax running Qwen2.5-72B on 2× RTX 5090s achieved: – 3.1× lower end-to-end latency, 5.3× faster inter-token latency – 2.9× faster time-to-first-token, 3.1× higher I/O throughput Results were consistent and showed great scalability across different input configurations, and this is just the beginning.
66,34K