Parallax by Gradient

Parallax by Gradient is a distributed model serving framework that enables users to create decentralized AI clusters across heterogeneous devices, including consumer-grade GPUs, Apple Silicon Macs, and cloud servers. It allows seamless deployment of large language models (LLMs) through pipeline parallelism and dynamic resource allocation. The system operates without centralized coordination, leveraging peer-to-peer communication powered by Lattica for node coordination.
The core value lies in democratizing access to high-performance LLM inference by aggregating underutilized computational resources. It eliminates hardware uniformity requirements, enabling collaborative model execution across geographically dispersed devices with varying specifications. This approach optimizes resource utilization while maintaining low-latency performance through intelligent request routing and continuous batching.

Cross-platform pipeline parallelism supports model sharding across NVIDIA GPUs (Ampere/Hopper/Blackwell architectures), Apple Silicon chips, and x86 CPUs. The framework automatically partitions models based on device capabilities using layer-wise allocation strategies. This enables execution of models like Qwen3-0.6B across mixed hardware environments.
Dynamic KV cache management with hybrid backends combines SGLang for GPU acceleration and MLX LM for Apple Silicon optimization. The system implements context-aware caching that adapts to batch sizes from 1 to 8 tokens, maintaining throughput above 100 tokens/sec on consumer hardware. Continuous batching automatically reprioritizes requests based on node availability.
Decentralized orchestration uses libp2p-based networking with NAT traversal capabilities, supporting both LAN and public internet configurations. The scheduler node employs consensus algorithms for fault tolerance, maintaining cluster stability even with 30% node churn. Nodes self-report capabilities including VRAM capacity, compute benchmarks, and network latency during join operations.

Addresses the challenge of running modern LLMs (70B+ parameters) on resource-constrained devices by distributing computational load. Traditional single-node deployments require expensive A100/H100 clusters, while Parallax enables cost-effective inference using consumer RTX 30/40 series GPUs and M-series Macs.
Targets AI developers and organizations needing to deploy models across mixed infrastructure, including edge devices, on-prem servers, and cloud instances. Research teams conducting distributed systems experiments for LLM optimization will particularly benefit.
Enables use cases like personal AI clusters combining gaming PCs and laptops, multi-region model serving with latency-based routing, and hybrid cloud-edge deployments for data residency compliance. Supports real-time applications through sub-200ms response times in optimized configurations.

Unlike centralized inference services, Parallax implements true peer-to-peer architecture with zero control plane dependency. The framework maintains <500ms inter-node latency even with 16+ participants through optimized libp2p pubsub protocols. Competitors like Ray Serve require centralized head nodes.
Introduces hardware-adaptive quantization that automatically selects optimal precision (FP8/FP16/INT4) per device capability. The Blackwell GPU support enables mixed-precision tensor operations across RTX 50 series and B100/B200 architectures. MLX LM integration provides 40% faster Mac performance versus llama.cpp.
Competes through native support for 15+ model architectures including DeepSeek-V3.1, GLM-4.6, and Meta Llama 3 variants. The model zoo includes pre-configured partitioning templates for 70B parameter models across 8 nodes. Automatic failover reallocates layers within 2 seconds of node disconnection.

What models does Parallax currently support? The framework natively supports Qwen3 variants (0.6B-70B), DeepSeek-V3.1, GLM-4.6, MiniMax-M2, and Meta Llama 3 series. Custom model integration is possible through HuggingFace Transformers compatibility with PyTorch 2.3+ and safetensors format.
Can Windows devices participate in AI clusters? Yes, the Windows client supports NVIDIA GPUs with driver version 535+, using WSL2 for Linux compatibility. Administrator privileges are required for CUDA toolkit dependencies. Apple Silicon Macs join clusters through native macOS builds with MLX 0.8+.
How does Parallax handle different GPU architectures in one cluster? The scheduler uses compute capability scores (Ampere=8.0, Hopper=9.0, Blackwell=9.5) to allocate layers. RTX 4090 nodes receive more compute-intensive middle layers, while RTX 3060 nodes handle input/output layers. Memory-bound layers automatically deploy to nodes with >24GB VRAM.

Host LLMs across devices sharing GPU to make your AI go brrr