Forge CLI

Definition: Forge CLI is a command-line interface (CLI) tool for automated GPU kernel optimization. It belongs to the technical category of AI-driven compiler tools and swarm-based code generation systems.
Core Value Proposition: Forge CLI exists to dramatically accelerate GPU inference for PyTorch and HuggingFace models by generating highly optimized CUDA/Triton kernels. Its primary value is delivering up to 5× faster inference than torch.compile(mode='max-autotune') with a 97.6% correctness guarantee, using a scalable swarm of AI agents.

Swarm Agent Optimization: 32 parallel Coder+Judge agent pairs compete to generate the fastest valid kernels. Coders create kernels using Retrieval-Augmented Generation (RAG) from a database of 1,711 CUTLASS and 113 Triton templates. Judges validate correctness before compilation. This massively parallel search enables deep exploration of optimization strategies like tensor core utilization (WGMMA/TMA), memory coalescing, and kernel fusion.
Evolutionary Search Architecture: Combines MAP-Elites (quality-diversity across 36 behavioral cells) and Island Models (4 specialized populations with migration). Mutations are guided by an LLM (NVIDIA Nemotron 3 Nano 30B), enabling efficient traversal of the optimization space. This ensures coverage across key GPU bottlenecks: memory-bound, compute-bound, fused ops, and tensor core optimizations.
Inference-Time Scaling: Powered by a fine-tuned NVIDIA Nemotron 3 Nano 30B model running at 250k tokens/sec, enabling rapid kernel exploration. This allows Forge to optimize all layers of any HuggingFace model ID in minutes instead of hours. Outputs include native CUDA (compiled with nvcc) or Triton JIT-compiled Python DSL kernels, serving as drop-in PyTorch replacements.

Pain Point: Manual GPU kernel optimization is time-intensive and requires expert-level CUDA/Triton knowledge. Existing auto-tuners like torch.compile often fail to achieve peak hardware performance, especially on complex architectures like Llama or SDXL.
Target Audience: Machine Learning Engineers deploying PyTorch/HuggingFace models, GPU Kernel Developers, MLOps teams optimizing inference latency, and researchers needing hardware-efficient model implementations.
Use Cases:
- Optimizing transformer layers (GQA, RoPE, SwiGLU) in LLMs like Llama-3.1-8B (5.16× speedup).
- Accelerating diffusion model components (e.g., SDXL UNet cross-attention + conv fusion, 2.87× speedup).
- Real-time inference for speech models (Whisper-large-v3 encoder-decoder, 2.63× speedup).

Differentiation: Outperforms torch.compile(mode='max-autotune') by 2.18× on average (up to 5.16×) with near-perfect (97.6%) correctness. Unlike traditional auto-tuners, Forge uses semantic embeddings (1536-dim) and LLM-guided mutations to explore novel optimization strategies.
Key Innovation: The Coder+Judge swarm architecture with 32 parallel agents enables unprecedented search breadth. Combined with inference-time scaling via Nemotron 3 Nano 30B, it achieves 5–100× faster optimization cycles than conventional methods while supporting datacenter GPUs (H100, B200, H200).

How does Forge CLI guarantee kernel correctness?
Forge uses Judge agents to validate kernel logic before compilation and runs post-compile correctness checks, achieving a 97.6% validated accuracy rate across benchmarks.
What models and hardware does Forge CLI support?
It supports any PyTorch nn.Module, HuggingFace Model ID, or KernelBench task. Optimized for NVIDIA datacenter GPUs (H100, B200, A100) with CUDA/Triton compatibility.
Is Forge CLI faster than NVIDIA’s cuDNN or Triton?
Yes, by leveraging swarm-generated kernel fusion and tensor core optimizations, Forge outperforms manual cuDNN/Triton implementations and auto-tuned PyTorch, as demonstrated in Llama-3.1-8B (5.16×) and Phi-3-mini (2.75×) benchmarks.
What happens if Forge doesn’t beat torch.compile?
RightNow AI offers a full refund if generated kernels fail to outperform torch.compile(mode='max-autotune') for your model, per their public guarantee.
How are credits consumed in Forge’s pricing model?
Credits are used per optimization task (e.g., one HuggingFace model ID). Bulk discounts apply (e.g., 100 credits for $112.50), with unused credits retained for future jobs.

Swarm agents optimize CUDA/Triton for any HF/PyTorch model