Fast, local AI comes to Windows Copilot+ PCs

Mu is a 330-million-parameter on-device small language model (SLM) developed by Microsoft, optimized to run locally on Neural Processing Units (NPUs) in Copilot+ PCs. It enables real-time AI interactions by processing natural language inputs and mapping them to specific system functions, such as adjusting Windows Settings.
The core value of Mu lies in its ability to deliver high-speed, low-latency AI inference entirely offline, ensuring privacy and reducing dependency on cloud resources while maintaining performance comparable to larger models.

Mu employs an encoder-decoder transformer architecture that separates input encoding from output decoding, reducing computational overhead by 47% and achieving 4.7× faster decoding speeds compared to decoder-only models of similar size.
The model is fully offloaded to NPUs, leveraging hardware-specific optimizations for Qualcomm Hexagon, Intel, and AMD NPUs to achieve over 100 tokens per second throughput and sub-500ms response times in real-world applications like the Windows Settings agent.
Advanced quantization techniques, including 8-bit and 16-bit integer weight representations, reduce memory usage while maintaining accuracy, enabling efficient deployment on resource-constrained devices like the Surface Laptop 7.

Mu addresses the challenge of navigating complex system settings in Windows by translating natural language queries into precise function calls, eliminating the need for manual menu exploration.
It targets users of Copilot+ PCs who require instant, privacy-preserving AI assistance for local tasks without relying on cloud connectivity or sacrificing performance.
Typical use cases include adjusting display brightness, managing network configurations, and resolving ambiguous queries like "increase volume" across multi-monitor setups by prioritizing frequently used settings.

Unlike cloud-dependent models, Mu operates entirely on-device, ensuring data privacy and eliminating latency from network round-trips, while outperforming similarly sized models like Phi-3.5-mini in task-specific fine-tuning scenarios.
Innovations like dual LayerNorm (pre- and post-normalization), Rotary Positional Embeddings (RoPE) for context extrapolation, and Grouped-Query Attention (GQA) enable stable training and efficient long-context processing at scale.
Competitive advantages include a 2/3–1/3 encoder-decoder parameter split for NPU efficiency, weight sharing between input/output embeddings, and synthetic data scaling techniques that achieved 3.6M training samples for Settings agent fine-tuning.

How does Mu differ from cloud-based AI models? Mu processes all data locally on NPUs, ensuring no user data leaves the device, which enhances privacy and reduces latency compared to cloud-dependent solutions.
Can Mu handle ambiguous or incomplete user queries? The agent integrates with Windows Settings search, using lexical/semantic fallbacks for short queries while reserving Mu for multi-word inputs, ensuring balanced accuracy and responsiveness.
What hardware is required to run Mu? Mu requires Copilot+ PCs with NPUs from Qualcomm, Intel, or AMD, as it relies on hardware-specific optimizations for tasks like the Settings agent and cannot function on CPUs/GPUs alone.
How was Mu optimized for NPU performance? Collaboration with silicon partners enabled operator-level tuning, alignment with NPU memory constraints, and quantization validation, achieving 200+ tokens/second on a Surface Laptop 7.
Does Mu support third-party app integrations? Currently, Mu is task-specific to Windows system functions, but its architecture allows for future expansion via LoRA fine-tuning for additional use cases.

Subscribe to Our Newsletter