Ollama v0.7 logo

Ollama v0.7

Run leading vision models locally with the new engine

2025-05-19

Product Introduction

  1. Ollama v0.7 is a local inference engine designed to run multimodal AI models, including vision-capable large language models (LLMs), with enhanced reliability and accuracy.
  2. The core value of Ollama v0.7 lies in its ability to process multimodal inputs (text, images, and future modalities) efficiently on local hardware while optimizing memory usage and maintaining model-specific architectural integrity.

Main Features

  1. Ollama v0.7 supports advanced vision models such as Meta Llama 4 Scout (109B MoE), Google Gemma 3, Qwen 2.5 VL, and Mistral Small 3.1, enabling tasks like visual question answering, multi-image analysis, and document scanning.
  2. The engine implements model-specific memory optimizations, including image caching for faster follow-up prompts and KV cache tuning for hardware-aware memory allocation, allowing longer context processing on constrained systems.
  3. Modular model architecture isolates each model's execution logic, enabling independent implementation of vision encoders, text decoders, and projection layers without cross-model interference or system-wide code patches.

Problems Solved

  1. Ollama v0.7 addresses the challenge of inconsistent multimodal support in local inference tools by providing a unified engine that respects model-specific architectures like sliding window attention and 2D rotary embeddings.
  2. The product targets developers and researchers who require local deployment of state-of-the-art multimodal models for applications in visual reasoning, document analysis, and cross-modal data processing.
  3. Typical use cases include analyzing video frames for geographic context (e.g., distance calculations between landmarks), identifying common elements across multiple images, and extracting text from vertically aligned Chinese documents.

Unique Advantages

  1. Unlike generic LLM runners, Ollama v0.7 implements exact replication of model architectures as defined in research papers, including Meta's chunked attention and Gemma 3's sliding window attention, preventing output degradation over long sequences.
  2. The engine integrates directly with the GGML tensor library through Go bindings, enabling custom inference graphs and hardware-specific optimizations validated by partnerships with NVIDIA, AMD, Intel, and Qualcomm.
  3. Competitive differentiation comes from first-class multimodal orchestration, where vision encoders and text decoders operate within isolated model containers with dedicated projection layers, eliminating shared dependency conflicts common in llama.cpp-based systems.

Frequently Asked Questions (FAQ)

  1. Which vision models are currently supported in Ollama v0.7? Ollama v0.7 supports Meta Llama 4 Scout (109B MoE), Google Gemma 3, Qwen 2.5 VL, Mistral Small 3.1, and other models with verified implementations in the GGML tensor library.
  2. How does Ollama handle multiple images in a single query? Users can input multiple image paths sequentially or through follow-up prompts, with the engine processing them through batch-aware positional encoding and cross-image attention mechanisms as defined by each model's architecture.
  3. What memory optimizations are implemented for large images? Image embeddings are split into hardware-appropriate batches using model-defined boundaries, while cached intermediate representations reduce reprocessing overhead for subsequent queries involving the same images.

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news