Foundation Models framework

Apple's Foundation Models framework is a developer toolkit that provides direct access to a ~3-billion-parameter on-device language model optimized for Apple silicon, enabling privacy-focused AI integration in apps.
The framework’s core value lies in its seamless integration with Swift, hardware-optimized efficiency, and cost-free inference, prioritizing user privacy while maintaining high performance for generative AI tasks like text summarization and entity extraction.

The framework supports guided generation through Swift’s @Generable macro, allowing developers to define structured output formats that the model adheres to via OS-level constrained decoding and speculative execution.
It enables tool calling via a Swift protocol, letting developers create custom APIs for the model to invoke services or retrieve data while handling parallel/serial tool execution automatically.
Developers can train rank-32 adapters using a Python toolkit to specialize the base model for niche tasks, with adapter weights compatible across framework versions for backward compatibility.

Addresses privacy concerns by eliminating cloud dependency for AI inference, ensuring sensitive data remains on-device and compliant with Apple’s strict privacy standards.
Targets iOS/macOS developers seeking to integrate AI features like text refinement or image understanding without managing server infrastructure or incurring API costs.
Supports use cases such as localized content generation, in-app document summarization, and visual data parsing (e.g., extracting event details from flyers) while maintaining low latency.

Unlike cloud-reliant alternatives, the framework leverages Apple silicon’s neural engine for sub-100ms latency and 2-bit quantization, reducing memory usage by 37.5% via KV cache sharing.
Introduces PT-MoE architecture for server models, combining parallel transformer tracks with mixture-of-experts layers to cut synchronization overhead by 87.5% while scaling to 14T training tokens.
Outperforms comparable 3B-4B parameter models (e.g., Qwen-2.5-3B, Gemma-3-4B) in human evaluations for multilingual and image tasks, with 33.5% win rates in English text responses and 46.6% against InternVL-2.5-4B in image understanding.

How does guided generation ensure output format compliance? The Swift compiler translates @Generable-annotated types into schema specifications injected into prompts, while post-training on format-aligned datasets enables the model to natively generate structured outputs validated by OS daemons.
Can the on-device model process non-English languages? Yes, the framework supports 15 languages via a 150K-token vocabulary and locale-specific evaluations, achieving 30.2% win rates against Qwen-2.5-3B in PFIGSCJK (Portuguese, French, Italian, German, Spanish, Chinese, Japanese, Korean) locales.
How are server models optimized for efficiency? The PT-MoE architecture uses block-level parallelism and ASTC texture compression (3.56 bits/weight) with hardware-accelerated decoding, achieving 2.7% quality regression on MGSM benchmarks despite 50% smaller memory footprints compared to Llama-4-Scout.

Build with Apple's on-device AI, now open to developers