Product Introduction
- Definition: RunInfra is an AI infrastructure automation and optimization platform. It is a technical service that automates the deployment, benchmarking, and performance tuning of open-source AI models for production inference.
- Core Value Proposition: RunInfra exists to eliminate the complexity and guesswork from deploying production-ready AI APIs. It automatically finds the most cost-effective and performant hardware and software stack for any given model and workload, moving from a model name to a measured, deployable endpoint in minutes.
Main Features
- Automated Model & Engine Benchmarking: RunInfra automatically tests a specified open-source model across multiple serving engines (like vLLM, SGLang, TensorRT-LLM) and GPU types (from L4 to B200). It benchmarks key metrics including p95 latency, throughput, VRAM usage, and cost per million tokens to identify the optimal combination.
- Intelligent Optimization & Tuning: The platform applies a suite of production-grade optimizations automatically. This includes weight quantization (e.g., AWQ int4), kernel optimization (FlashAttention v2), runtime tuning (continuous batching, paged KV cache with fp8, speculative decoding), and server configuration autotuning. These are applied based on the user's goal (lowest cost, lowest latency).
- Portable Deployment Kit: After optimization, RunInfra provides a complete, runnable deployment package. This includes a Dockerfile with the tuned serving configuration, a
serve.shscript, and aruninfra.yamlmanifest. This stack can be deployed on RunInfra's managed cloud, exported to platforms like Modal or RunPod, or run on private infrastructure, ensuring no vendor lock-in.
Problems Solved
- Pain Point: The extreme complexity and expertise required to manually benchmark AI models, select the right GPU, apply correct optimizations, and tune serving engines for production. This process is time-consuming, error-prone, and often leads to over-provisioning or poor performance.
- Target Audience: AI Engineers, ML Ops Engineers, and Developers building with open-source models who need production-grade performance without becoming experts in GPU kernel tuning. Startups and enterprises looking to reduce inference costs and latency while maintaining control over their stack.
- Use Cases: Deploying a cost-optimized chat API using Llama 3.1; Building a low-latency document search pipeline with BGE-M3 embeddings; Creating a speech-to-text API with Whisper Large V3; Setting up a multi-model routing layer to direct queries to the most cost-effective model.
Unique Advantages
- Differentiation: Unlike generic cloud GPU marketplaces or basic model hosting services, RunInfra performs deep, automated comparative benchmarking and low-level runtime optimization. It doesn't just host your model; it scientifically finds and builds the best possible stack for it. Unlike closed-source APIs (e.g., OpenAI), it provides full stack transparency and portability.
- Key Innovation: The "Forge" agent that generates and tests custom CUDA kernels for specific model-engine-GPU combinations. This moves beyond configuration into the realm of automated compiler-level optimization, which is typically a manual, expert-only task. The platform's ability to produce a verifiable "benchmark receipt" provides unprecedented transparency.
Frequently Asked Questions (FAQ)
- How does RunInfra optimize model inference cost? RunInfra reduces inference cost by automatically testing your model on cheaper GPU tiers (like L4 or L40S), applying aggressive quantization (e.g., AWQ int4) to reduce VRAM needs, and tuning the serving engine parameters (batch size, KV cache) for maximum throughput per dollar, often cutting costs by 50-70% versus unoptimized baselines.
- Can I use RunInfra with models not listed on Hugging Face? The primary workflow is designed for models available on Hugging Face. For custom or private models, you would need to ensure they are compatible with one of the supported serving engines (vLLM, TensorRT-LLM, etc.) and may require a more manual configuration process.
- What is the difference between RunInfra's managed service and exporting the stack? The managed service is a fully hosted API endpoint billed per million tokens, where RunInfra handles scaling, uptime, and infrastructure. Exporting the stack gives you all the configuration files and Docker images to run the optimized model on your own infrastructure (e.g., your own Kubernetes cluster, RunPod, or Modal account), giving you full control and data sovereignty.
- Does RunInfra support vision or multimodal models? Yes, the platform supports a range of model types beyond text LLMs. This includes vision models (Qwen2-VL, Llama 3.2 Vision), speech models (Whisper, Parler-TTS), and embedding models (NV-Embed, BGE-M3), allowing for the optimization of full multimodal pipelines.
- How does RunInfra ensure my data privacy and security? RunInfra employs end-to-end encryption for data in transit, provides isolated GPU infrastructure for workloads, and adheres to a zero data retention policy for customer data processed on its platform. It is also SOC 2 Type II certified, audited for security and privacy controls.
