A fully functional autonomous AI Agent for Windows 11 that uses Ollama with Qwen3-VL to:
- See: Capture screenshots in real-time
- Think: Analyze visual state with vision-language model
- Act: Execute computer actions autonomously
- Loop: Continue until task completion
STC/
├── 📄 Core Agent Files
│ ├── see_think_act_agent.py # Main agent implementation
│ ├── config.py # Configuration settings
│ └── examples.py # Example tasks
│
├── 🛠️ Utility Modules
│ └── utils/
│ ├── __init__.py
│ ├── screenshot_capture.py # Screen capture (mss)
│ ├── ollama_client.py # Ollama API wrapper
│ ├── action_executor.py # Action execution (pyautogui)
│ └── agent_function_call.py # Function calling framework
│
├── 📓 Documentation
│ ├── README.md # Full documentation
│ ├── QUICKSTART.md # Quick start guide
│ └── see_think_act_demo.ipynb # Interactive demo notebook
│
├── ⚙️ Setup Files
│ ├── requirements.txt # Python dependencies
│ ├── setup.ps1 # Automated setup script
│ └── .gitignore # Git ignore rules
│
└── 📁 Generated Directories (auto-created)
├── screenshots/ # Default screenshots
├── agent_screenshots/ # Agent execution screenshots
├── logs/ # Log files
└── model_responses/ # Model response history
- ✅ Fast screen capture using
msslibrary - ✅ Multi-monitor support
- ✅ Base64 encoding for API transmission
- ✅ Configurable format and quality
- ✅ Connects to Ollama API
- ✅ Handles image encoding
- ✅ Supports function calling
- ✅ Parses computer use actions
- ✅ Connection testing utility
- ✅ Mouse control (click, move, drag)
- ✅ Keyboard control (type, press keys, hotkeys)
- ✅ Scrolling support
- ✅ Normalized coordinate system (0-1000 scale)
- ✅ Configurable timing and delays
- ✅ Failsafe mechanisms
- ✅ Complete See-Think-Act loop
- ✅ Task execution with iteration limit
- ✅ Screenshot history saving
- ✅ Comprehensive logging
- ✅ Error handling and recovery
- ✅ Task status tracking
- ✅ Conversation history
- ✅ Centralized settings
- ✅ Easy customization
- ✅ Model parameters
- ✅ Timing controls
- ✅ Safety settings
- ✅ 5+ example tasks
- ✅ Interactive menu
- ✅ Sequential execution option
- ✅ Customizable tasks
- ✅ Comprehensive README
- ✅ Quick start guide
- ✅ Interactive Jupyter notebook
- ✅ Installation instructions
- ✅ Troubleshooting tips
- ✅ PowerShell setup script
- ✅ Requirements file
- ✅ Automated dependency installation
- ✅ Connection testing
-
Install Ollama and pull model:
ollama pull qwen3-vl:235b-cloud
-
Install dependencies:
pip install -r requirements.txt -
Run the agent:
python see_think_act_agent.py
.\setup.ps1from see_think_act_agent import SeeThinkActAgent
# Initialize
agent = SeeThinkActAgent(
model="qwen3-vl:235b-cloud",
max_iterations=30,
save_screenshots=True
)
# Run a task
result = agent.run("Open Notepad and type 'Hello World!'")
print(result)# Run examples
python examples.py
# Run main agent
python see_think_act_agent.pyjupyter notebook see_think_act_demo.ipynb| Component | Technology | Purpose |
|---|---|---|
| Vision Model | Qwen3-VL | Image understanding & reasoning |
| LLM Runtime | Ollama | Local model inference |
| Screen Capture | mss | Fast screenshot capture |
| GUI Control | PyAutoGUI | Mouse & keyboard automation |
| Image Processing | Pillow | Image manipulation |
| Notebook | Jupyter | Interactive demos |
| Function Calling | qwen-agent | Structured tool use |
The agent can perform these actions:
- ✅ Mouse: Left/right/middle click, double-click, move, drag
- ✅ Keyboard: Type text, press keys, hotkey combinations
- ✅ Scroll: Vertical scrolling
- ✅ Wait: Pause for UI updates
- ✅ Terminate: Mark task as complete
- ✅ Maximum iteration limit (prevents infinite loops)
- ✅ Failsafe: Move mouse to corner to abort
- ✅ Keyboard interrupt (Ctrl+C)
- ✅ Screenshot history for review
- ✅ Comprehensive logging
- ✅ Configurable timeouts
┌─────────────────────────────────────────────┐
│ 1. User provides task │
└──────────────┬──────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ 2. LOOP: While not complete │
│ ┌─────────────────────────────────┐ │
│ │ a. SEE - Capture screenshot │ │
│ └──────────────┬──────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ b. THINK - Analyze with model │ │
│ │ • Send screenshot to Qwen3-VL│ │
│ │ • Get action decision │ │
│ └──────────────┬──────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────┐ │
│ │ c. ACT - Execute action │ │
│ │ • Mouse click / keyboard │ │
│ │ • Wait for UI update │ │
│ └──────────────┬──────────────────┘ │
│ │ │
│ ┌──────────────▼──────────────────┐ │
│ │ d. Check if complete │ │
│ │ • Task done? │ │
│ │ • Max iterations? │ │
│ └─────────────────────────────────┘ │
└─────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────┐
│ 3. Return result │
│ • Success/failure status │
│ • Number of iterations │
│ • Elapsed time │
└─────────────────────────────────────────────┘
- Open applications (Notepad, Calculator, etc.)
- Type text
- Perform calculations
- Navigate File Explorer
- Web browsing and searching
- File management operations
- Multi-step workflows
- Application interaction
Solution: ollama pull qwen3-vl:235b-cloud
Solution: Ensure Ollama is running (ollama serve)
Solution: Check screen resolution in agent initialization
Solution: Adjust PYAUTOGUI_PAUSE in config.py
- Screenshot Capture: ~50-100ms
- Model Inference: ~2-5 seconds per action
- Action Execution: ~50-500ms
- Total per iteration: ~3-6 seconds
Potential improvements:
- Multi-monitor support
- Action replay/recording
- Task planning optimization
- Better error recovery
- Voice control integration
- Mobile device support
- Cloud model options
Before first use:
- Python installed
- Ollama installed
- Model downloaded
- Dependencies installed
- Connection test passed
- Ollama: https://ollama.ai
- Qwen3-VL: https://github.com/QwenLM
- PyAutoGUI: https://pyautogui.readthedocs.io
- MSS: https://python-mss.readthedocs.io
If you encounter issues:
- Check
README.mdfor detailed documentation - Review
QUICKSTART.mdfor setup help - Examine logs in
logs/directory - Check saved screenshots in
agent_screenshots/ - Run connection test:
python -c "from utils.ollama_client import OllamaVisionClient; OllamaVisionClient().test_connection()"
Everything is set up and ready to go. Start with:
- Run
.\setup.ps1to verify setup - Try
python examples.pyfor guided examples - Open
see_think_act_demo.ipynbfor interactive learning
Have fun with your autonomous AI agent! 🚀