Ollama v0.19 logo

Ollama v0.19

Massive local model speedup on Apple Silicon with MLX

2026-04-01

Product Introduction

  1. Definition: Ollama v0.19 is a major update to the local large language model (LLM) inference engine, specifically redesigned to optimize performance on macOS through Apple’s MLX machine learning framework. It serves as a localized backend for running sophisticated AI models, now featuring native support for NVIDIA's NVFP4 quantization and advanced context management.

  2. Core Value Proposition: Ollama v0.19 exists to provide the fastest possible local inference experience on Apple Silicon hardware. By shifting the underlying architecture to MLX, it maximizes the potential of Apple’s unified memory and neural accelerators, significantly reducing "Time to First Token" (TTFT) and increasing generation speeds for complex agentic workflows and coding tasks.

Main Features

  1. MLX Framework Integration for Apple Silicon: Ollama v0.19 replaces previous inference methods on macOS with Apple’s MLX framework. This integration utilizes the unified memory architecture of M-series chips (M1 through M5) to eliminate data transfer bottlenecks. On the latest M5, M5 Pro, and M5 Max chips, the software leverages dedicated GPU Neural Accelerators, resulting in a 56% increase in prefill performance (up to 1810 tokens/s) and nearly a 100% increase in decode speed (up to 112 tokens/s) compared to version 0.18.

  2. NVFP4 Quantization Support: This version introduces support for NVIDIA’s NVFP4 format. By utilizing 4-bit floating-point quantization, Ollama v0.19 maintains high model accuracy while drastically reducing the memory bandwidth and storage footprint required for inference. This allows local users to achieve production-grade parity with cloud-based inference providers and run models optimized via NVIDIA’s model optimizer directly on Mac hardware.

  3. Advanced Context Caching and Intelligent Checkpoints: The caching engine has been overhauled to support efficient agentic workflows. It introduces "Intelligent Checkpoints," which store snapshots of the cache at strategic locations within a prompt to minimize redundant processing. "Smarter Eviction" policies ensure that shared system prefixes (like developer instructions or tool definitions) remain in memory even when older conversation branches are discarded, leading to more responsive multi-turn sessions.

Problems Solved

  1. High Latency in Local LLMs: Version 0.19 addresses the "wait time" associated with large models (35B+ parameters). By optimizing prefill speeds, it eliminates the lag commonly felt when starting a new conversation or feeding large codebases into an agent.

  2. Memory Inefficiency in Agentic Workflows: Standard inference engines often reprocess the entire prompt history for every turn. Ollama v0.19 reuses cache across different conversations and branches, which is essential for tools like Claude Code that frequently branch out to perform different sub-tasks using the same system prompt.

  3. Target Audience: The update is specifically designed for Software Engineers, AI Researchers, and Power Users who require private, high-performance local AI. It targets "Agentic Developers" using tools like OpenClaw and Codex, as well as Mac users with high-spec hardware (32GB+ RAM) who need to run large-scale models like Qwen3.5-35B locally.

  4. Use Cases: Key scenarios include local coding agents (Claude Code, Pi) that require rapid code generation, personal assistants that handle long-context documents, and offline development environments where data privacy and low-latency interaction are non-negotiable.

Unique Advantages

  1. Differentiation: Unlike traditional GGUF-based local runners that rely on generalized implementations, Ollama v0.19 is specifically "silicon-aware." By targeting the MLX framework, it outperforms generic implementations on macOS, effectively turning a Mac into a high-throughput AI workstation that rivals dedicated Linux servers with discrete GPUs.

  2. Key Innovation: The standout innovation is the "Production-to-Local" parity achieved through NVFP4 and MLX. This allows a developer to experiment with the exact same quantized weights used in high-scale production environments, ensuring that the behavior and accuracy of the model on their laptop perfectly match the final deployment environment.

Frequently Asked Questions (FAQ)

  1. How much faster is Ollama v0.19 compared to version 0.18 on Apple Silicon? Ollama v0.19 offers significant speedups on M5 chips, with prefill performance jumping from 1154 tokens/s to 1810 tokens/s, and decode speeds nearly doubling from 58 tokens/s to 112 tokens/s. When running int4 quantization, decode speeds can reach up to 134 tokens/s.

  2. What are the hardware requirements for running the new Qwen3.5-35B model? To run the Qwen3.5-35B-A3B model using the NVFP4 quantization in this preview release, users are required to have a Mac with Apple Silicon and at least 32GB of unified memory.

  3. What is NVFP4 and why is it important for local LLM inference? NVFP4 is a 4-bit floating-point format developed by NVIDIA. It is critical for local inference because it allows users to run large, high-parameter models with lower memory and storage requirements without sacrificing the quality or accuracy of the responses, aligning local performance with industry-standard production environments.

  4. Does Ollama v0.19 improve battery life or resource usage on Mac? Yes. Through smarter cache reuse and more efficient eviction policies, the system performs less redundant computation. By leveraging the MLX framework and dedicated GPU Neural Accelerators, the software completes inference tasks faster, allowing the hardware to return to idle states sooner and reducing overall memory utilization across multiple AI-driven conversations.

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news