Interactive AI browser automation using Browser Use and Microsoft Foundry. Natural language browser agent with human-in-the-loop intervention points for authentication, verification, and confirmation. DOM-first agent powered by OpenAI models on Microsoft Foundry, controlling Chromium directly via CDP (Chrome DevTools Protocol) for web task automation.
Read the accompanying blog post: Building a Browser Agent with Microsoft Foundry and Browser Use
Microsoft Foundry provides managed OpenAI model hosting on Azure with enterprise features like quota management, regional deployment, and integration with Azure's identity and networking stack.
Requires Azure setup first. You'll need a Microsoft Foundry deployment before running the agent - see Setup below.
uv run python browse.pyBrowse - AI browser automation
What would you like to do?
> Check bbc.co.uk for the top 5 news stories and list them out
The agent navigates, searches, and extracts information, pausing when it needs your help with authentication, verification, or important decisions.
browse.py provides an interactive CLI for natural language browser automation. The agent works autonomously, showing real-time progress updates:
Step 1/25: Opening bbc.co.uk (1.2s)
Step 2/25: Navigating to News section (0.8s)
Step 3/25: Extracting top headlines (0.6s)
The agent pauses for human input only when necessary (authentication, CAPTCHA, ambiguous choices, or confirmation before actions like purchases). When the task completes, you'll see structured results and options to continue or exit.
During agent execution, a persistent footer pinned to the bottom of the terminal shows available shortcuts. B and Q respond immediately without waiting for the current agent step to finish:
| Key | Action |
|---|---|
| B | Toggle browser window visibility (hidden by default) |
| V | Toggle verbose mode on/off |
| F | Toggle vision mode (screenshot analysis for visually complex pages) |
| I | Send new instructions to the agent |
| P | Pause/resume agent execution |
| Q | Quit |
The browser starts hidden and auto-shows when authentication or CAPTCHA is needed. On macOS, hide/show uses osascript for instant response; other platforms use CDP window bounds.
The agent requests human input in these situations:
Security and verification:
- Authentication required - Login pages or session timeouts (you log in manually in the browser)
- CAPTCHA/verification - Human verification challenges (you solve the CAPTCHA)
Decision making:
- Ambiguous choice - Multiple valid options where the agent cannot determine which to choose (you select from a numbered list)
- Confirmation before action - Destructive or significant actions like purchases, form submissions, or deletions (you confirm or cancel)
- Confidence check - When the agent is uncertain about a finding or next step, it pauses and asks you to decide
Progress management:
- Agent stuck - After 3 consecutive failures, the agent asks if you want to retry, get a page description, provide new instructions, or abort
- Approaching max steps - At 80% of the step limit, you can increase the limit, wrap up, or stop
- Progress checkpoint - At major phase transitions (e.g. searching to comparing), the agent shows a brief progress summary and asks whether to continue or adjust
Planning:
- Batched upfront questions - Before starting, the agent analyses your task for likely ambiguities and asks 0-3 clarifying questions up front to minimise interruptions
- Sub-goal summary - When a sub-goal completes, the agent shows a mini-summary and asks "what next?"
The terminal bell rings when the agent needs your attention (works with iTerm2 and most terminals).
See docs/interaction-spec.md for the complete UX specification.
Sessions maintain context across tasks - follow-up tasks understand what happened before. After each task, the agent generates a concise summary that feeds into the next task's prompt, so you can build on prior results naturally.
> Find the top 3 wireless keyboards under £50 on Amazon UK
✓ Task completed in 9 steps (16.3s)
Found 3 wireless keyboards:
1. Logitech K380 - £29.99
2. Anker A7726 - £25.99
3. iClever BK10 - £33.99
Session log saved to: logs/browse-session-20260216-143022.md
What would you like to do next?
1. New task (with session context)
2. View session log
3. Exit
Choose [3]: 1
What would you like to do?
> Compare those top 2 and tell me which has better reviews
The agent remembers finding those keyboards and knows which two you mean.
- Default - Findings-only summary after each task
- Verbose - Full action log plus findings. Enable with
uv run python browse.py --verboseor opt in at session start - JSON - Structured output for piping to other tools:
uv run python run_task.py --json "your task"
Session logs are auto-saved to logs/ after each task completes. The path is shown in dim text below the results. Choose "View session log" from the completion menu to display the full summary. Exports include all task summaries, structured data, and a suggested follow-up prompt formatted for Claude Code.
- Python 3.12+
- UV package manager (
curl -LsSf https://astral.sh/uv/install.sh | sh) - Azure CLI installed and logged in (
az login) - Azure subscription with access to deploy Microsoft Foundry (AI Services) resources
- Platform: Developed and tested on macOS with iTerm2. Linux should work with minor adjustments. Windows users will need WSL or Git Bash for the deployment scripts. Terminal bell notifications work best with iTerm2.
Cost note: Running tasks costs roughly $0.01-0.05 per simple task using GPT-4.1-mini. See Cost and capacity for details.
- Clone the repository:
git clone https://github.com/Sealjay/foundry-browser-use.git
cd foundry-browser-use- Install Python dependencies:
uv sync- Install Chromium (used by Browser Use via CDP):
uv run playwright install chromiumDeploy a Microsoft Foundry resource to rg-browser-agent in uksouth:
Option A: Azure CLI deployment
./infra/deploy.sh rg-browser-agent uksouth oai-foundry-browserOption B: Bicep deployment
./infra/deploy-bicep.sh rg-browser-agentBoth scripts will output environment variable values - copy these to your .env file. When you're done, remember to tear down these resources to avoid ongoing charges.
Copy .env.example to .env and fill in your Microsoft Foundry credentials from the deployment output:
cp .env.example .env
# Edit .env with your Foundry endpoint, API key, and deployment nameSecurity note: Never commit
.envto version control (already in.gitignore). On shared machines, restrict permissions:chmod 600 .env.
The interactive CLI (browse.py) is the recommended way to run browser automation tasks. For scripting or non-interactive use, you can also:
Run the example agent:
uv run python agent.pyRun a custom task programmatically:
uv run python run_task.py "go to google.com and search for cats"The agent can be configured with the following parameters:
use_vision: Set toFalse(default) for DOM-only mode, orTrueto include screenshots. DOM-only mode is significantly faster and more cost-effective.max_steps: Maximum number of steps the agent can take (default: 25 in interactive mode, 10 in scripting mode)model: Swap between different OpenAI models on Microsoft Foundry by changingAZURE_OPENAI_DEPLOYMENT_NAMEin.env
When using the interactive CLI (browse.py), the agent starts with 25 steps and offers to increase the limit if needed.
See Microsoft Foundry model documentation for available models and deployment guidance.
Important: The browser-use library sends anonymous usage data to PostHog by default. This project disables telemetry in
.env.examplebecause browser automation tasks can involve sensitive sites and workflows. EnsureANONYMIZED_TELEMETRY=falseis set in your.envfile (already included in.env.example).
Pricing is indicative only and subject to change. For current rates, see the Azure Pricing Calculator. As a rough guide (as of February 2025), GPT-4.1-mini costs approximately $0.40 per 1M input tokens and $1.60 per 1M output tokens. Actual costs vary based on task complexity, context size, and Azure region.
Tear down when idle: Run
./infra/teardown.sh rg-browser-agentwhen you're finished to stop all charges. The agent can consume quota quickly during multi-step tasks.
The deployment scripts default to 150K TPM (GlobalStandard SKU). Browser automation is token-heavy - each step sends DOM content plus conversation history, so a single task can consume 10-50K tokens. With multi-turn sessions the context grows further. If you hit rate limits (HTTP 429), increase your TPM allocation in the Azure portal or via the deployment scripts. The Azure CLI deployment accepts --sku-capacity and the Bicep template has a skuCapacity parameter.
To delete all Azure resources and avoid charges:
./infra/teardown.sh rg-browser-agentWarning: This permanently deletes the resource group and all resources within it.
foundry-browser-use/
browse.py # Interactive CLI entry point
agent.py # Demo agent (scripting)
run_task.py # One-shot task runner (scripting)
browser_agent/ # Interactive CLI package
cli.py # CLI orchestrator
runner.py # Agent execution wrapper
intervention.py # Human intervention handlers
keyboard.py # Keyboard shortcuts, agent state, persistent footer
display.py # Result formatting
session.py # Multi-turn session context
infra/ # Azure deployment scripts
deploy.sh # Azure CLI deployment
deploy-bicep.sh # Bicep deployment
main.bicep # Bicep template
teardown.sh # Resource cleanup
docs/
interaction-spec.md # UX specification
- DOM-only mode works best on content-heavy and form-based sites. Modern SPAs with heavily obfuscated DOMs, Shadow DOM, or Canvas/WebGL rendering may not parse well. Press
Fduring execution to toggle vision mode on, or setuse_vision=Truein code for visually complex pages (costs more tokens). - Anti-bot measures: Many websites detect and block browser automation. The agent pauses for CAPTCHAs, but persistent blocking or account flagging is possible. Respect target sites' terms of service.
- Platform: Built and tested on macOS. Shell scripts, terminal features (bell, persistent footer), and browser window management (osascript on macOS, CDP on others) may behave differently on Linux or Windows/WSL.
Contributions are welcome via pull request.
This project is licensed under the MIT Licence - see the LICENCE file for details.