Kuzco

Kuzco is a Swift package that enables integration of large language models (LLMs) directly into iOS, macOS, and Mac Catalyst applications using on-device inference powered by llama.cpp. It provides developers with tools to run AI models locally without relying on cloud services or external APIs.
The core value of Kuzco lies in its ability to deliver privacy-focused, high-performance AI capabilities optimized for Apple platforms, ensuring data remains on-device while supporting modern Swift concurrency patterns like async/await.

Kuzco supports multiple LLM architectures (LLaMA, Mistral, Phi, Gemma, Qwen) with automatic detection from filenames, enabling seamless deployment of diverse models without manual configuration.
It offers async/await-friendly APIs for streaming responses, allowing real-time interaction with models while managing memory constraints typical of mobile and desktop applications.
The package includes advanced resource management features such as instance caching, GPU layer offloading for Apple Silicon, and configurable CPU thread allocation to optimize performance across devices.

Kuzco addresses privacy concerns by eliminating cloud dependencies, ensuring sensitive data never leaves the device during LLM inference.
It targets iOS/macOS developers requiring on-device AI capabilities for applications like chatbots, content generation tools, or localized data analysis without internet connectivity.
Typical use cases include implementing offline-first AI assistants, enabling secure enterprise document processing, and creating personalized user experiences with customizable prompt templates.

Unlike generic LLM wrappers, Kuzco provides Swift-native APIs specifically optimized for Apple platforms, including Metal acceleration and memory-efficient context management tailored for mobile devices.
Its architecture fallback system automatically selects compatible model variants when exact matches are unavailable, reducing deployment friction for less common GGUF files.
Competitive advantages include built-in support for 15+ model architectures, detailed error handling with recovery suggestions, and preconfigured quantization profiles for balancing performance and model quality.

How do I resolve "unknown model architecture" errors? Ensure your GGUF filename contains recognizable architecture keywords (e.g., "mistral" or "phi-3"), or explicitly specify the architecture using ModelProfile.createWithFallback() to enable compatibility checks and automatic fallbacks.
Why does inference speed vary across devices? Performance depends on hardware capabilities; adjust InstanceSettings parameters like gpuOffloadLayers (35+ for M-series chips) and cpuThreadCount (4-8 for modern processors) to match your target device's specifications.
What quantization formats are recommended for iOS? Use Q4_0 or Q4_1 quantized models to balance quality and memory usage, as they typically require 4-6GB RAM while maintaining acceptable response accuracy for most applications.

Open-source Swift package to run LLMs locally on iOS & macOS