An always-on visual AI companion that lives directly in your macOS menu bar, guiding you in real-time.
Echo is a premium, always-on visual AI companion for macOS. Driven by a simple global keyboard shortcut, Echo looks at your screen, listens to your voice instructions, and translates your intentions into physical operating system interactionsโlike soaring across monitors to point out specific options, open local directories, or programmatically closing Finder windows.
This project is inspired by the original open-source learning-buddy assistant created by Farza (MIT License).
While preserving the magical essence of an on-screen visual companion that physically flies along physics-based arcing curves, Aditya Gupta has extensively re-engineered the application, introducing a powerful Agentic click-sync layer and deep OS automation. We are still actively developing and constantly improving this project to reach its full potential.
The Core Aim: To create an interactive AI companion that helps beginners, self-learners, and non-trained individuals directly master any software in real-time, right inside the application they are trying to learn.
Traditionally, learning new software means constantly changing tabs, pausing a YouTube tutorial, switching back to the app, making a mistake, and repeating this frustrating cycle. Echo destroys this friction. It acts as a visual, interactive tutor directly on top of your workspaceโguiding your eyes, highlighting buttons, and opening resources without you ever leaving your active window.
- Parabolic Bezier Flight: The companion (a glowing blue triangle) does not simply teleport. It flies along physics-based quadratic curves, rotating dynamically to align with its trajectory, scaling up at the apex, and landing exactly on target.
- Push-To-Talk Listening: Hold down
Control + Optionto transform the companion into an active, glowing blue voice waveform that listens to your queries. - Offline Local Transcription: Uses macOS's native, offline Apple Speech framework to transcribe your voice locally on-device. No audio files are sent over the network, ensuring complete privacy.
- Bilingual & Dialect Support: Natively supports dictation and voice interactions in Hindi, Hinglish (Hindi written in Roman script), and English. The local speech recognizer is fully optimized for Indian accents, and the AI model is instructed to automatically match your spoken language when talking back.
- Circular Magnifying Lens Highlight: When pointing at folders or items, Echo overlays a premium high-tech circular magnifying lens directly over the target, rendering a crop of the raw screen buffer scaled by 1.3x with a gloss glare metal rim so the user instantly spots their item.
- Agentic OS Integration: Dynamically parses voice instructions into structured local actions, like opening paths, closing windows, maximizing windows to fill the screen, or tiling/shifting windows to the left or right half of the screen.
- Starvation-Proof Timers: Core tracking and animations are scheduled on
RunLoop.mainin.commonmode, preventing the cursor from freezing even during heavy UI interaction (like window dragging).
Echo's operation is structured into 6 highly optimized stages that run sequentially on the main thread and background tasks:
sequenceDiagram
autonumber
actor User
participant Keyboard as globalShortcutMonitor
participant Speech as AppleSpeechEngine
participant Capturer as ScreenCaptureUtility
participant AI as GroqAPI (Llama-4 Scout)
participant OS as macOS (AX & Finder APIs)
participant Bubble as SwiftUI OverlayWindow
User->>Keyboard: Hold [Control + Option] & Speak
Keyboard->>Speech: Stream mic audio buffer
Speech-->>User: Transcribe speech to text offline (Real-time)
User->>Keyboard: Release [Control + Option]
Keyboard->>Capturer: Request high-res screenshot (2048px max)
Capturer-->>Keyboard: Return JPEG image data
Keyboard->>AI: Send image + transcription prompt
AI-->>Keyboard: Return response with [POINT:x,y:label] and [RUN:...]
Keyboard->>OS: Run Priority Snapping (Accessibility -> Finder -> Vision)
OS-->>Keyboard: Return perfect, resolved screen point
Keyboard->>Bubble: Trigger Parabolic Bezier flight animation
Bubble-->>User: Snap precisely onto target & show spoken text bubble
| Layer | Technologies Used |
|---|---|
| Core Frameworks | SwiftUI, AppKit, Foundation, Combine |
| System Automation | CoreGraphics, Accessibility API (AXUIElement), NSAppleScript |
| Screen Capture | ScreenCaptureKit (Optimized display filtering) |
| Voice Processing | AVFoundation (Microphone recording), Speech (Offline transcription) |
| AI LLM API | Groq Vision API (Low latency endpoint) |
| Vision Model | meta-llama/llama-4-scout-17b-16e-instruct |
| Text-To-Speech | ElevenLabs TTS (with Apple System speech fallback) |
To make Echo a truly production-grade tool with pixel-perfect accuracy, we built several core custom systems on top of the original prototype:
- Problem: Original regex parsers were rigid and expected exact integer coordinates with no spaces. Standard Llama models frequently output float coordinates (like
0.074for percentages) or add standard spacing ([POINT: 810, 0.074 : microsoft]), causing the previous parser to fail and read raw coordinates out loud. - Solution: Rewrote the parser using a case-insensitive, whitespace-tolerant pattern (
#"\[\s*POINT\s*:\s*(?:none|(\d+(?:\.\d+)?)\s*,\s*(\d+(?:\.\d+)?)...)\s*\]"#). All tags are stripped cleanly from spoken text, and normalized decimal coordinates (0.0 to 1.0) are automatically detected and scaled to matching pixel coordinates based on the screen size.
- Problem: Screenshots were previously limited to a max dimension of
1280pixels, which heavily blurred small text (like "Battery" or "Displays" in System Settings) on high-density Retina displays. - Solution: Increased the capture limit to
2048max dimension, delivering crisp text resolution to the vision model and dramatically reducing AI hallucination.
- Problem: Vision coordinate regression is never 100% pixel-perfect (often off by 20โ40 pixels).
- Solution: Built a multi-layered, dynamic snapping pipeline. To prevent Finder's background layout containers from hijacking desktop searches:
- If Finder is the active app, the snapping engine prioritizes the Finder Desktop Snap (AppleScript desktop grid lookup) first to guarantee pixel-perfect icon center locking.
- If any other app is active, it prioritizes the Accessibility Snap (
AXUIElementwindow hierarchy traversal, filtering out dummy/container layout elements with coordinates near zero), falling back to Finder Desktop icons and then Vision coordinates.
- Problem: Self-learners and non-trained individuals in regions like India often communicate using a mix of Hindi and English (Hinglish) or pure Hindi. Standard speech engines defaulting to US English fail to transcribe these accents or mixed vocabularies, and standard AI agents reply in formal English, causing a cognitive disconnect for natural conversation.
- Solution:
- Re-architected
AppleSpeechTranscriptionProvider.swiftto dynamically prioritizeen-IN(Indian English/Hinglish) andhi-IN(Hindi) locales at the top of the speech-recognition cascade, offering flawless offline, low-latency native dictation for mixed Indian accents and dialects. - Injected conversational multilingual directives into the companion's core Groq LLM system prompt (
CompanionManager.swift). The companion now understands Hinglish queries (e.g., "KURSOR folder kaha hai?" or "Settings me Battery option open karo") and naturally responds back in casual Hinglish or Hindi, while retaining perfect English responses for English prompts.
- Re-architected
- Problem: Simply pointing with a generic cursor or blinking reticle is often overlooked on dense high-resolution Retina screens, leaving users searching for small folders or buttons.
- Solution:
- Added raw CGImage passing inside
CompanionScreenCaptureUtility.swiftwhen display screenshots are taken. - Once the snapping overrides resolve the perfect coordinate,
CompanionManagertranslates it back to screenshot pixels and crops a high-precision160x160square pixel slice of the screen buffer around the target center. - Exposes the crop as a SwiftUI image inside
OverlayWindow.swiftwhich is clipped to a circular lens, scaled up to 1.3x magnification, and styled with a metal rim glare reflection and a pulsing dotted focus border.
- Added raw CGImage passing inside
ECHO/
โโโ README.md # Project vision, features, and setup
โโโ ARCHITECTURE.md # Deep-dive module breakdowns
โโโ build.sh # Custom compilation & codesigning script
โโโ create_icns.sh # macOS AppIcon generator (portable)
โโโ echo/ # Core Swift Code
โ โโโ echoApp.swift # Main Entry point & App delegate
โ โโโ CompanionManager.swift# Main state machine, API pipeline, and coordinate overrides
โ โโโ OverlayWindow.swift # Bezier flight & BlueCursorView SwiftUI layout
โ โโโ CompanionScreenCaptureUtility.swift # SCKit screen capturer
โ โโโ ElementLocationDetector.swift # AX element traversal
โ โโโ GroqAPI.swift # Vision API connection
โ โโโ Assets.xcassets/ # Icon assets
โ โโโ Config.swift # Model config & Env loader
โ โโโ DesignSystem.swift # Theme and typography styles
โ โโโ echo.entitlements # App privileges (Sandbox disabled for automation)
โโโ echoTests/ # Tests
โโโ echoUITests/ # UI Tests
โโโ worker/ # Auxiliary cloud resources
- A Mac running macOS 13.0 or later.
- Xcode Command Line Tools installed:
xcode-select --install
- Create a
.envfile in theechosubfolder (or let the app read your environment):GROQ_API_KEY=your_groq_api_key MODEL_NAME=meta-llama/llama-4-scout-17b-16e-instruct
- Build and sign the application from the project root:
./build.sh
- Run the application bundle:
open Echo.app
When you launch Echo for the first time, a polished system status checklist will appear. Make sure to Grant these essential system permissions:
- Accessibility: For mouse tracking and window automation.
- Screen Recording: To allow the vision AI to see your screen context.
- Microphone & Speech Recognition: For push-to-talk recording and offline transcription.
- Desktop Access: To allow Echo to open local paths for you.
- Trigger: Press and hold
Control + Option. The blue companion will transform into a glowing soundwave, listening to you. - Instruction: Speak your query while holding the keys (e.g. "point to the Battery option in my settings" or "open the folder KURSOR").
- Release: Release the keys. The soundwave will spin while processing, and then swoops to point cleanly to your target!
This project is licensed under the MIT License. See the LICENSE file for details.
Aditya Gupta
AI Systems Engineer & Swift Developer
If you find this project inspiring or helpful, consider leaving a โญ on the repository!