SelfHostLLM

SelfHostLLM is a specialized calculator designed to determine GPU memory requirements and maximum concurrent request capacity for self-hosted large language model (LLM) inference deployments. It supports popular open-source models including Llama, Qwen, DeepSeek, Mistral, and their variants across multiple parameter scales and quantization formats. The tool provides infrastructure planning insights by analyzing hardware configurations against model-specific memory demands.
The core value lies in enabling precise resource allocation for AI inference workloads, preventing underutilization or overprovisioning of GPU resources. It translates technical specifications into actionable metrics for deployment planning, particularly valuable for organizations balancing performance requirements with infrastructure costs.

The calculator implements a transparent formula-based approach using Total VRAM = (Number of GPUs × VRAM per GPU) - System Overhead - (Base Model Memory × Quantization Factor). This accounts for framework-specific memory fragmentation and operating system overhead.
Dynamic adjustment for 6 quantization formats (FP16/BF16 to Extreme Quant) provides realistic memory reduction estimates, including specialized formats like MXFP4 that offer 70% compression while maintaining model integrity.
Context-length-aware KV cache calculation incorporates variable attention mechanisms (MHA, MQA, GQA) through adjustable overhead percentages, with presets for common context windows from 2K to 1M tokens.

Eliminates guesswork in GPU cluster sizing by quantifying the relationship between model architecture choices (parameter count, attention type) and hardware capabilities. This prevents costly trial-and-error deployments common in transformer-based model hosting.
Serves DevOps teams and ML engineers responsible for maintaining inference SLAs while optimizing cloud/on-premise GPU costs. Particularly crucial for startups and enterprises deploying chat interfaces, RAG systems, or batch processing pipelines.
Addresses critical deployment scenarios including determining minimum viable GPU count for 70B+ parameter models, evaluating quantization tradeoffs for specific use cases, and capacity planning for traffic spikes in production environments.

Unlike generic VRAM calculators, it incorporates framework-specific realities like PyTorch's memory fragmentation (default 2GB system overhead) and Tensor Parallelism requirements for multi-GPU configurations.
First open-source tool with built-in presets for 42+ model architectures including Mixtral's conditional MoE activation patterns and DeepSeek-R1's extreme scaling characteristics, verified against actual deployment data.
Combines academic memory estimation models with empirical validation from production systems, offering both theoretical maximums and practical "safe" thresholds for reliable operation.

How accurate are the concurrency estimates compared to real-world performance? The calculations represent worst-case scenarios assuming full context window usage across all concurrent requests, making them conservative estimates that typically exceed real-world performance where average context lengths are shorter.
Does the calculator account for variable sequence lengths in dynamic batching? While the base formula assumes fixed context lengths, users can input actual average token counts (rather than maximum context) for more realistic KV cache estimates that better align with dynamic batching implementations.
What quantization method provides the best balance between memory savings and model quality? MXFP4 quantization generally offers superior performance-to-compression ratios for supported architectures, though INT4 remains the safest cross-platform option with broad framework support.
How does multi-GPU configuration affect total capacity? When using tensor parallelism, available VRAM scales linearly but requires subtracting additional 0.5-1GB per GPU for inter-GPU communication buffers, which the calculator automatically factors into its system overhead estimates.
Can this predict memory needs for fine-tuned model variants? The tool supports custom parameter inputs for modified architectures, but users must manually adjust base memory requirements if layer structures significantly deviate from original model designs.

Calculate the GPU memory you need for LLM inference