Product Introduction
- Definition: Mellum is an open-source large language model (LLM) family developed by JetBrains, specifically engineered for ultra-low-latency and high-performance inference. Its core technical category is a compact, production-ready Mixture-of-Experts (MoE) model optimized for real-world coding and development workflows.
- Core Value Proposition: Mellum exists to provide fast, cost-efficient AI inference without compromising on quality for common development tasks. It delivers the low latency and high throughput required for real-time systems, making it an ideal backbone for AI-powered developer tools and high-volume applications.
Main Features
- Mixture-of-Experts (MoE) Architecture: Mellum utilizes a sophisticated MoE design, which is fundamental to its speed. This architecture activates only a small subset of the model's total 12 billion parameters for any given request. By routing tasks to specialized expert sub-networks, it achieves ultra-low-latency inference and high throughput, often performing twice as fast as similarly-sized dense models. This design directly reduces computational load and memory usage, enabling efficient deployment.
- Dual-Task Training for Code and Language: Initially focused on code, Mellum has been expanded and trained from scratch on over 10 trillion tokens of data, including licensed code, web text, and mathematics. The pre-training occurs in three progressive stages, culminating in a dataset where code constitutes over 50% of the tokens. This is followed by large-scale supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR), resulting in a model proficient in both programming tasks and natural language understanding.
- Optimized Deployment and Cost Efficiency: Mellum is engineered for production deployment with a focus on low operational cost. Its compact KV-cache footprint per request allows a single GPU (like an H100, H200, or even newer B200/B300) to serve a high number of concurrent users without exhausting memory. This efficient compute utilization means it halves inference costs compared to larger models while maintaining strong quality for its target use cases. It is released under the Apache 2.0 license, with weights available on Hugging Face.
Problems Solved
- Pain Point: High latency and prohibitive operational costs associated with using large, general-purpose LLMs (like GPT) for real-time, high-volume AI workflows in software development.
- Target Audience: AI/ML engineers and researchers building intelligent systems; software developers using AI-assisted coding tools; DevOps engineers architecting AI infrastructure; and organizations seeking private, sovereign AI solutions for sensitive codebases.
- Use Cases: Smart routing and orchestration of AI workloads to the optimal model; Low-latency Retrieval-Augmented Generation (RAG) pipelines for fast code search and documentation Q&A; powering fast sub-agents in complex multi-step AI agent workflows; enabling private, local code completion and chat within IDEs; and serving as a high-speed layer for pre- and post-processing tasks alongside larger frontier models.
Unique Advantages
- Differentiation: Unlike most large LLMs which prioritize raw capability at the expense of speed and cost, Mellum is differentiated by its focus on inference efficiency. It is not designed to replace the largest models for the most complex reasoning tasks but to outperform them in scenarios demanding milliseconds of response time, high request volumes, and predictable, lower costs. It competes favorably with other small open-source LLMs but excels in deployment efficiency.
- Key Innovation: The key innovation is the application of a highly optimized Mixture-of-Experts architecture to a compact model class typically served by dense models. Combined with its deliberate training for a minimal KV-cache footprint, this allows Mellum to deliver near-frontier model quality for specific tasks while being deployable and scalable efficiently on a single high-end GPU, a significant advantage for both cloud and edge deployment.
Frequently Asked Questions (FAQ)
- How does Mellum compare to using a larger model like GPT for code completion? Mellum is designed for scenarios where ultra-low latency and cost efficiency are critical. It excels at providing fast responses (milliseconds vs. seconds) for high-volume tasks like real-time code suggestions, intelligent routing, and sub-agent workflows, making it ideal for production AI systems where GPT would be too slow or expensive at scale.
- Is Mellum open-source, and how can I deploy it? Yes, Mellum is fully open-source under the Apache 2.0 license. You can run it locally with Ollama, use it as a custom model in JetBrains AI Assistant, or deploy it via supported inference platforms. Its efficient design is optimized for deployment on H100, H200, and newer NVIDIA GPUs, even as a single-GPU instance.
- What programming languages does Mellum support? Mellum1 supports a wide range including Java, Kotlin, Python, Go, PHP, C, C++, C#, JavaScript, TypeScript, CSS, HTML, Rust, and Ruby. The latest Mellum2 model expands to support all major programming languages within a general chat interface, in addition to its strong natural language capabilities.
- What kind of data was Mellum trained on, and is it safe to use? Mellum was trained on a large corpus of permissively licensed code, web text, and math data. Its training process includes alignment steps for consistency, and it can be fine-tuned and deployed locally or in a private cloud, giving organizations full control over their data and privacy for sovereign AI use cases.
