nanochat

nanochat is a full-stack implementation of a ChatGPT-like large language model (LLM) designed to operate within a single, clean, and minimal codebase. It integrates all stages of LLM development, including tokenization, pretraining, finetuning, evaluation, inference, and a web-based user interface. The system is optimized to run end-to-end on a single 8XH100 GPU node, prioritizing simplicity and accessibility for developers and researchers.
The core value of nanochat lies in its ability to democratize LLM development by offering a dependency-lite, hackable codebase that reduces operational complexity and cost. It enables users to train, fine-tune, and deploy conversational AI models with minimal infrastructure overhead, making advanced LLM capabilities accessible at a fraction of traditional costs.

nanochat provides an end-to-end pipeline for LLM development, combining tokenization with a custom Rust-based BPE tokenizer, pretraining on raw text data, supervised finetuning (SFT), and evaluation metrics across benchmarks like MMLU, GSM8K, and HumanEval. The pipeline is fully automated through scripts like speedrun.sh, which trains a baseline model in ~4 hours on 8XH100 GPUs.
The system supports scalable model architectures via adjustable hyperparameters such as --depth (e.g., d26 for GPT-2-level performance) and --device_batch_size, enabling customization for hardware constraints. Users can train models ranging from a $100-tier 4e19 FLOPs "kindergartener" model to larger $300–$1,000 tiers with improved performance.
A lightweight web UI (scripts.chat_web) replicates ChatGPT’s interface, allowing real-time interaction with trained models. The UI serves inference endpoints on customizable ports and generates a report.md file with evaluation summaries, including CORE scores and task-specific metrics.

nanochat addresses the high cost and complexity of LLM development by consolidating the entire training-to-deployment workflow into a single codebase with minimal dependencies. It eliminates the need for fragmented toolsets and multi-node orchestration, reducing setup time and operational friction.
The product targets AI researchers, hobbyists, and educators seeking to experiment with LLMs without requiring enterprise-scale budgets or infrastructure. It is particularly useful for prototyping custom models, benchmarking performance, and teaching LLM concepts in academic settings.
Typical use cases include training cost-effective conversational agents, fine-tuning domain-specific models (e.g., healthcare or finance), and conducting reproducible experiments on model architectures or training methodologies.

Unlike frameworks like Hugging Face Transformers or OpenAI’s API, nanochat prioritizes transparency and simplicity by avoiding opaque abstractions and excessive dependencies. The codebase is intentionally minimal (45 files, ~8K lines) to facilitate deep customization and auditing.
The speedrun.sh script is a standout innovation, automating the full training lifecycle while logging metrics like wall-clock time and token throughput. This enables rapid iteration and benchmarking, with preconfigured hyperparameters for different budget tiers.
Competitive advantages include GPU memory optimization (via adjustable batch sizes and gradient accumulation), Rust-based tokenizer efficiency, and compatibility with consumer-grade hardware. The system also supports multi-GPU training via PyTorch’s torchrun while maintaining single-node simplicity.

What hardware is required to run nanochat? nanochat is optimized for 8XH100 GPU nodes with 80GB VRAM, but it can scale down to single-GPU setups by reducing --device_batch_size. For GPUs with less memory, adjust batch sizes (e.g., 32 → 16) or use gradient accumulation to avoid OOM errors.
Can I modify the model architecture for specific tasks? Yes, the codebase allows direct modification of hyperparameters like depth and attention heads. For example, increasing --depth=26 approximates GPT-2 performance, while adjusting training loops in base_train.py enables custom finetuning strategies.
How does the $100-tier model perform compared to ChatGPT? The baseline model (4e19 FLOPs) achieves limited capabilities (e.g., 7.58% GSM8K accuracy) and is designed for prototyping. Larger tiers (e.g., $300 for 12-hour training) improve metrics like CORE scores but remain research-focused rather than production-grade.

The best ChatGPT that $100 can buy