MiniCPM-V 4.5 logo

MiniCPM-V 4.5

GPT-4o level vision model on the phone

2025-08-26

Product Introduction

  1. MiniCPM-V 4.5 is an 8-billion-parameter open-source multimodal large language model (MLLM) designed for efficient image, video, and document understanding on local devices such as smartphones. It combines Qwen3-8B and SigLIP2-400M architectures to achieve GPT-4o-level performance while optimizing for mobile deployment.
  2. The core value of MiniCPM-V 4.5 lies in its ability to deliver state-of-the-art multimodal capabilities with minimal computational overhead, enabling high-resolution visual processing, long-context video analysis, and complex document parsing directly on consumer hardware.

Main Features

  1. State-of-the-Art Vision-Language Performance: MiniCPM-V 4.5 achieves an average score of 77.2 on OpenCompass, outperforming GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B in vision-language tasks. It processes high-resolution images up to 1.8 million pixels (e.g., 1344x1344) using LLaVA-UHD architecture with 4x fewer visual tokens than competitors.
  2. Efficient Video Understanding: A unified 3D-Resampler compresses 6 video frames (448x448) into 64 tokens, achieving a 96x compression rate and enabling high refresh rate (10FPS) video analysis. This supports benchmarks like Video-MME, LVBench, and MotionBench without increasing LLM inference costs.
  3. Controllable Hybrid Fast/Deep Thinking: Users can toggle between fast thinking for low-latency responses and deep thinking for complex problem-solving, balancing efficiency and accuracy across scenarios like real-time OCR or detailed document parsing.

Problems Solved

  1. High Computational Costs for Multimodal Tasks: Addresses the inefficiency of processing high-resolution images and long videos by reducing token counts and enabling GPU-free inference through optimizations like llama.cpp and ollama support.
  2. Mobile Deployment Limitations: Targets developers and researchers needing desktop-level AI performance on smartphones, with quantized models (int4, GGUF, AWQ) and iOS app optimizations for iPhone/iPad.
  3. Specialized Use Case Gaps: Solves OCR, document parsing, and multilingual support challenges by outperforming GPT-4o-latest on OCRBench and achieving SOTA results on OmniDocBench for PDF analysis across 30+ languages.

Unique Advantages

  1. Superior Efficiency-to-Performance Ratio: At 8B parameters, it surpasses 30B+ models like Qwen2.5-VL 72B in vision-language tasks while maintaining phone-compatible resource usage.
  2. Innovative Token Compression: The 3D-Resampler reduces video token counts by 96%, and LLaVA-UHD cuts image tokens by 75%, enabling simultaneous processing of 180 video frames or 1.8MP images.
  3. Commercial Accessibility: Free for academic use and available for commercial applications after registration, with enterprise-ready deployment options via SGLang, vLLM, and LLaMA-Factory fine-tuning.

Frequently Asked Questions (FAQ)

  1. Can MiniCPM-V 4.5 be used commercially? Yes, after completing a registration questionnaire, the model weights are free for commercial use under the Apache-2.0 license, with enterprise deployment support via quantized formats and inference frameworks.
  2. How does it handle long videos? The 3D-Resampler dynamically compresses up to 180 frames into 64 tokens per segment, enabling efficient analysis of high-FPS (10FPS) and long-duration videos without GPU acceleration.
  3. What makes its OCR capabilities superior? MiniCPM-V 4.5 achieves 85.7% accuracy on OCRBench, outperforming GPT-4o-latest through RLAIF-V training and LLaVA-UHD’s 1344x1344 resolution support for dense text extraction.
  4. Is local phone deployment feasible? Yes, optimized iOS apps and llama.cpp/ollama integrations enable CPU-based inference on iPhones and iPads, with quantized models (16 sizes) reducing memory usage by up to 75%.
  5. Does it support multilingual inputs? The model processes 30+ languages via RLAIF-V training, with benchmarks showing improved accuracy over GPT-4o-latest in non-English document parsing and video understanding tasks.

Submit to 240+ Directories with 1-Click

Maximize your product's SEO and drive massive traffic by automatically submitting it to over 240 curated startup directories using DirSubmit.

Subscribe to Our Newsletter

Get weekly curated tool recommendations and stay updated with the latest product news