Introducing Parallax, the first fully distributed inference and serving engine for large language models. Try it now: 🧵
AI is reaching a bottleneck. LLMs are reshaping how we think, build, and create, but their demand for tokens is outpacing what centralized infra can deliver. Chips saturated; Power grids strained; Intelligence remains locked behind high-cost silos. We need a new paradigm.
Parallax reimagines model inference as a global, collaborative process, one where models are no longer chained to centralized infrastructure, but are instead recomposed, executed, and verified across a global mesh of compute.
The engine introduces 3 foundational shifts: – Intelligence sovereignty: serve models from the hardware you trust – Composable inference: GPUs, Apple Silicon, desktops working in harmony – Latent compute: activate into the world’s untapped compute
The Parallax Runtime Layer is the core orchestration engine for high-throughput, server-side LLM serving across distributed, heterogeneous networks. It delivers server-grade optimizations—from continuous batching to paged KV-cache—and is the first MLX-based framework to enable professional-grade inference on Apple Silicon. By unifying NVIDIA GPUs and Apple devices into a single compute fabric, Parallax brings frictionless decentralized AI to everyone.
Parallax runs on a distributed architecture called the Swarm: a dynamic network of nodes that collaboratively serve LLMs. Each prompt is processed across heterogeneous nodes, with each handling a segment of the model. The result: real-time inference that is decentralized, fluid, and verifiable.
Compared to Petals (BitTorrent-style serving), Parallax running Qwen2.5-72B on 2× RTX 5090s achieved: – 3.1× lower end-to-end latency, 5.3× faster inter-token latency – 2.9× faster time-to-first-token, 3.1× higher I/O throughput Results were consistent and showed great scalability across different input configurations, and this is just the beginning.
Now live: a chatbot fully powered by Parallax. Every response is generated peer-to-peer with no centralized server involved. Experience decentralized LLM inference:
The swarm is growing. Apply to join the Edge Host Pilot Program to scale the world’s intelligence:
67,97K