Before starting, ensure you have:
- Windows 11 (or Windows 10)
- Python 3.9 or higher
- Ollama installed
- Internet connection (for model download)
Download and install from: https://ollama.ai
Or use winget:
winget install Ollama.OllamaOpen PowerShell and run:
ollama pull qwen3-vl:235b-cloudThis will download the Qwen3-VL vision model (~13GB). Wait for it to complete.
In the project directory, run:
pip install -r requirements.txtOr use the automated setup script:
.\setup.ps1Run the test:
python -c "from utils.ollama_client import OllamaVisionClient; client = OllamaVisionClient(); client.test_connection()"You should see: ✓ Model 'qwen3-vl:235b-cloud' is available
python see_think_act_agent.pyOr for interactive examples:
jupyter notebook see_think_act_demo.ipynbTry this simple example:
from see_think_act_agent import SeeThinkActAgent
# Initialize the agent
agent = SeeThinkActAgent(
model="qwen3-vl:235b-cloud",
max_iterations=20,
save_screenshots=True
)
# Run a simple task
result = agent.run("Open Notepad and type 'Hello from AI!'")
print(result)- Agent starts: You'll see log messages indicating the agent is starting
- Screenshot capture: The agent takes a screenshot of your desktop
- Thinking: The model analyzes the screenshot (takes a few seconds)
- Action: The agent performs an action (e.g., clicking, typing)
- Repeat: Steps 2-4 repeat until task completion
- Console logs: Watch the terminal for real-time updates
- Screenshots: Check
agent_screenshots/folder to see what the agent saw - Stop anytime: Press
Ctrl+Cor move mouse to top-left corner
agent.run("Open Calculator")agent.run("Open Calculator and calculate 42 + 17")agent.run("Open Notepad and type 'The AI agent is working!'")agent.run("Open Microsoft Edge")- Run:
ollama pull qwen3-vl:235b-cloud - Verify:
ollama list
- Make sure Ollama is running
- Restart Ollama if needed
- Check your screen resolution matches agent settings
- Adjust coordinates in
action_executor.pyif needed
- Adjust
pyautogui.PAUSEinaction_executor.py - Modify wait times in the agent
- Start with simple, safe tasks
- Keep important work saved
- Don't leave the agent unattended on complex tasks
- Use a test environment first
- Review the code to understand what it does
Once comfortable with basic tasks:
- Try complex tasks: Multi-step workflows
- Customize the agent: Adjust parameters and behavior
- Add new actions: Extend the action executor
- Create workflows: Chain multiple tasks together
- Review screenshots: Learn how the agent "sees"
- Read the full
README.mdfor detailed documentation - Check
see_think_act_demo.ipynbfor examples - Review saved screenshots to debug issues
- Check console logs for error messages
$ python see_think_act_agent.py
================================================================================
Starting task: Open Notepad and type 'Hello from AI!'
================================================================================
Iteration 1/30
================================================================================
Capturing screenshot...
Screenshot saved: agent_screenshots/screenshot_001_20250101_120000.png
Thinking and deciding next action...
Model response: {action: "left_click", coordinate: [50, 950]}
Executing action: left_click at (96, 1026)
Iteration 2/30
================================================================================
Capturing screenshot...
Thinking and deciding next action...
Model response: {action: "type", text: "notepad"}
Executing action: type 'notepad'
...
================================================================================
TASK COMPLETED: Task completed in 8 iterations
Status: success
Time: 45.23 seconds
================================================================================
Enjoy using your See-Think-Act AI Agent! 🤖