Product Introduction
- Definition: LocalClicky is an open-source, MIT-licensed, offline voice assistant and computer control application designed specifically as a macOS menubar utility. It functions as a privacy-first, local AI-powered interface for voice-to-text transcription, natural language processing, and direct system control without requiring cloud connectivity.
- Core Value Proposition: LocalClicky exists to solve the fundamental privacy trade-off of modern voice assistants. It enables a user to have a "real conversation" with their computer—issuing chained commands, controlling the UI with voice, and interacting with on-screen elements—while ensuring zero data leaves the local machine. The core proposition is complete local execution for voice, AI reasoning, vision, and control.
Main Features
- Offline Voice Pipeline & Session Control: The application uses a wake word ("Computer") to initiate a session managed by a state machine. It employs Whisper.cpp (running locally via the
whisper-clisubprocess) for high-accuracy offline speech-to-text transcription. Voice Activity Detection (VAD), powered by thewebrtcvad-wheelslibrary, automatically stops recording upon detecting silence, eliminating fixed timeouts. A session remains active for multi-turn conversation, chaining commands back-to-back until the user says a dismissal phrase like "goodbye," creating a seamless, conversational flow without repeated wake word invocations. - Local AI Reasoning & Tool Calling: At its core, LocalClicky integrates Ollama to run local large language models. The primary command model (default:
qwen3:8b) handles user intent, reasoning, and complex tool-calling workflows. It can invoke multiple tools in sequence—like executing shell commands (run_shell_command), querying system information (query_system), or analyzing the screen—within a single request, supporting up to 5 rounds of tool calls with streaming output. Conversation memory is maintained for the last 10 exchanges within a session. - Vision-Based Screen Interaction: LocalClicky features an on-demand vision system. When a command requires visual context (e.g., "click the notification bell"), the assistant automatically triggers the
look_at_screentool. This tool captures a screenshot using macOS nativescreencapture, resizes it for efficiency, and sends it to a multimodal vision model (default:gemma4:e4brunning locally via Ollama). The model analyzes the visual input and returns precise bounding box coordinates, which the application'sCursorControlmodule then uses to calculate and execute a click or hover action via PyAutoGUI. - Comprehensive macOS Control & Integration: Beyond vision, LocalClicky provides extensive direct system control. It can open/quit applications, adjust system and app-specific volume (e.g., Spotify via AppleScript), manage files via shell commands, create calendar reminders using natural language dates, control Chrome via JavaScript injection, and perform other macOS-specific tasks. The response is delivered through the built-in macOS
saycommand, ensuring text-to-speech operates entirely offline.
Problems Solved
- Pain Point: The primary problem is the privacy and security risk inherent in cloud-based voice assistants, where user audio, screen data, and commands are uploaded to external servers. LocalClicky eliminates this by performing all processing—transcription, AI inference, and vision analysis—directly on the user's hardware. It also solves the inconvenience of fixed-timeout recording and the limitation of single-command voice assistants by offering a persistent, chainable session mode.
- Target Audience: This tool is essential for privacy-conscious macOS power users, developers, creative professionals, and security-aware individuals who want efficient voice control without compromising data confidentiality. It caters to users who manage sensitive data on their machines and seek to automate repetitive UI interactions or system tasks hands-free.
- Use Cases: Key use cases include: Developer workflow automation (e.g., "Open a new terminal tab and run
npm test"); Accessibility enhancement (using voice to navigate UIs and perform clicks); Productivity for knowledge workers (managing browser tabs, applications, and reminders without touching the mouse/keyboard); and Privacy-sensitive environment operations (using AI assistance in environments where network data transmission is prohibited or undesirable).
Unique Advantages
- Differentiation: Unlike mainstream cloud assistants (Siri, Alexa) or online AI tools, LocalClicky's architecture guarantees complete data locality. There are no API keys, no subscriptions, and no telemetry. Compared to other local voice tools, its key differentiator is the integrated, multi-model pipeline that combines offline wake word detection (via Google Speech Recognition), local transcription (Whisper), local LLM reasoning with tool use (Ollama), and local vision analysis (Ollama multimodal) into a single, cohesive menubar application. It focuses specifically on deep, actionable computer control rather than just querying an LLM.
- Key Innovation: The key innovation is the tightly integrated, all-local, multi-modal tool-calling pipeline. The assistant doesn't just transcribe and respond; it perceives the screen, reasons about it, and acts upon it through native macOS subsystems. The architecture allows the vision model (
gemma4) to be called autonomously by the command model (qwen3) when visual data is needed, creating a self-determining agent that interacts with the graphical user interface in real-time, all while staying offline.
Frequently Asked Questions (FAQ)
- How does LocalClicky maintain privacy, and what data leaves my computer? LocalClicky is designed for complete offline operation. All voice transcription, AI processing, and vision analysis occur locally on your Mac using Whisper.cpp and Ollama. The only exception is the initial wake word detection ("Computer"), which uses Google's Speech Recognition API and requires a temporary internet connection for that specific function. No audio, screenshots, or commands are permanently stored or transmitted after processing.
- What are the system requirements to run LocalClicky? You need a macOS 12+ machine with Python 3.11+ and Homebrew installed. Approximately 8GB of free RAM is recommended to run both the command (
qwen3:8b) and vision (gemma4:e4b) Ollama models simultaneously. You must also grant Microphone, Screen Recording, and Accessibility permissions to the Python environment. An internet connection is required only for the wake word detection component. - Can I use different AI models with LocalClicky, or is it limited to Qwen and Gemma? While optimized for
qwen3:8b(command) andgemma4:e4b(vision), the system is configurable. You can swap models in the code, but the command model must support reliable tool calling, and the vision model must be multimodal (image-capable). Alternative tested combinations includeqwen3:14bfor better reasoning orqwen2.5vl:7bas an alternative vision model. Changing models affects performance and hardware requirements. - How does LocalClicky handle clicking on specific items on my screen? When a command implies visual interaction (e.g., "click the save button"), the assistant internally calls the
look_at_screenfunction. It takes a screenshot, sends it to the local vision model (gemma4), and receives bounding box coordinates for the target element. TheCursorControlmodule then calculates the center of that box and uses PyAutoGUI to execute a mouse click, effectively allowing you to interact with graphical elements using natural voice commands.
