llm-driven-web-agent
Facilitate the programmatic control of web environments using natural language prompts, enabling tasks such as site navigation, data form submission, and element interaction via an integrated language model interface.
Author

pietrozullo
Quick Info
Actions
Tags
Language Model-Orchestrated Web Interface Control
A FastMCP service designed to permit large language models (LLMs) to execute complex web browsing procedures through text-based instructions. This server exposes an interface that allows an LLM to programmatically steer a browser instance to interact with web pages, populate input fields, trigger button clicks, and retrieve structured information.
Rapid Initialization Guide
1. Installation Procedure
Install the necessary client package, specifying your preferred backend provider (e.g., OpenAI):
bash pip install -e "git+https://github.com/yourusername/browser-use-mcp.git#egg=browser-use-mcp[openai]"
To include support for all available integrations: bash pip install -e "git+https://github.com/yourusername/browser-use-mcp.git#egg=browser-use-mcp[all-providers]"
Ensure the underlying browser automation tools are available: bash playwright install chromium
2. MCP Client Configuration Setup
Integrate the llm-driven-web-agent service endpoint into your primary MCP client configuration file:
javascript { "mcpServers": { "llm-web-agent": { "command": "browser-use-mcp", "args": ["--model", "gpt-4o"], "env": { "OPENAI_API_KEY": "your-openai-api-key", // Substitute with your actual key or environment variable path "DISPLAY": ":0" // Necessary for environments supporting a graphical display server } } } }
Remember to substitute the placeholder key with your valid authentication credential or configure it to read from an environment variable like process.env.OPENAI_API_KEY.
3. Utilizing the Service in an MCP Client
Python Example utilizing mcp-use
python import asyncio import os from dotenv import load_dotenv from langchain_openai import ChatOpenAI from mcp_use import MCPAgent, MCPClient
async def process_web_interaction(): # Load secrets from .env file if present load_dotenv()
# Initialize the client instance based on configuration
client = MCPClient(
config={
"mcpServers": {
"llm-web-agent": {
"command": "browser-use-mcp",
"args": ["--model", "gpt-4o"],
"env": {
"OPENAI_API_KEY": os.getenv("OPENAI_API_KEY"),
"DISPLAY": ":0",
},
}
}
}
)
# Select the generative model interface
llm = ChatOpenAI(model="gpt-4o")
# Establish the autonomous agent
agent = MCPAgent(llm=llm, client=client, max_steps=30)
# Execute the complex instruction set
query = """
Initiate navigation to https://github.com, execute a search query for 'browser-use-mcp', and subsequently generate a high-level summary of the project's purpose.
"""
result = await agent.run(
query,
max_steps=30,
)
print(f"\nFinal Output: {result}")
if name == "main": asyncio.run(process_web_interaction())
Configuration for Claude Desktop Environments
- Launch the Claude Desktop application.
- Navigate to the settings panel, typically under 'Settings → Experimental features'.
- Activate the Claude API Beta feature and enable OpenAPI schema exposure.
- Place the following configuration snippet into your application's specific configuration file:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%AppData%\Claude\claude_desktop_config.json
{ "mcpServers": { "browser-use": { "command": "browser-use-mcp", "args": ["--model", "claude-3-opus-20240229"] } } }
- Initiate a new dialogue session within Claude and issue directives for web operations.
Supported Model Integrations
The following Language Model endpoints are compatible with this browser automation service:
| Provider | Required API Key Environment Variable(s) |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
GOOGLE_API_KEY |
|
| Cohere | COHERE_API_KEY |
| Mistral AI | MISTRAL_API_KEY |
| Groq | GROQ_API_KEY |
| Together AI | TOGETHER_API_KEY |
| AWS Bedrock | AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY |
| Fireworks | FIREWORKS_API_KEY |
| Azure OpenAI | AZURE_OPENAI_API_KEY and AZURE_OPENAI_ENDPOINT |
| Vertex AI (Google) | GOOGLE_APPLICATION_CREDENTIALS |
| NVIDIA | NVIDIA_API_KEY |
| AI21 Labs | AI21_API_KEY |
| Databricks | DATABRICKS_HOST and DATABRICKS_TOKEN |
| IBM watsonx.ai | WATSONX_API_KEY |
| xAI | XAI_API_KEY |
| Upstage | UPSTAGE_API_KEY |
| Hugging Face | HUGGINGFACE_API_KEY |
| Ollama (Local) | OLLAMA_BASE_URL |
| Llama.cpp (Local) | LLAMA_CPP_SERVER_URL |
Consult the official LangChain documentation for further integration details: https://python.langchain.com/docs/integrations/chat/
Configuration of credentials can be centralized by creating a .env file in your project root:
OPENAI_API_KEY=your_openai_key_here
Or include the key for any other supported backend
Diagnostic and Resolution Guide
- Authentication Failures: Verify that the necessary secret key for the selected provider is correctly established within your operating environment variables or the
.envfile. - Provider Unavailability: Confirm that the specific package corresponding to your chosen model provider has been installed.
- Browser Automation Failures: Execute
playwright install chromiumto ensure the required browser binaries are present. - Model Specification Errors: If the service rejects the model name, explicitly assign a recognized model identifier using the
--modelflag during server invocation. - Debugging Verbosity: Activate detailed logging output by including the
--debugflag when launching the server process. - Client Setup Mismatch: Double-check that the command string and environment variable mapping in your client configuration precisely match the server's requirements.
Licensing
MIT # browser-use-mcp WIKIPEDIA: A headless browser operates as a web browser but lacks a graphical presentation layer. This mode facilitates the programmatic control of web page content through command-line interfaces or network protocols, closely mimicking the rendering capabilities of standard browsers, including CSS styling, JavaScript execution, and AJAX handling, which is often absent in simpler parsing tools. Modern browser engines (Chrome 59+, Firefox 56+) natively support this remote control functionality, superseding older solutions like PhantomJS.
== Primary Applications == The core use cases for operating browsers in a headless configuration include:
- Automated quality assurance workflows for contemporary web applications (Web Testing).
- Generating static snapshots (screenshots) of rendered web pages.
- Executing automated test suites targeting client-side JavaScript functionality.
- Orchestrating complex interactions across web interfaces.
=== Secondary Applications === Headless environments are also valuable for advanced web data harvesting (scraping). Furthermore, they were identified as a method to help search engines index content reliant on Ajax rendering. Conversely, misuse cases include launching distributed denial-of-service (DDoS) attacks, artificially inflating advertisement visibility metrics, or performing unauthorized automated site manipulation (e.g., credential stuffing). However, empirical traffic analysis suggests that malicious actors do not disproportionately favor headless browsers over standard ones for common attacks.
== Control Frameworks == Given that major browser vendors now offer native headless APIs, several unified software interfaces exist to manage this automation layer:
- Selenium WebDriver: Adheres to the W3C WebDriver specification for cross-browser automation.
- Playwright: A robust Node.js library supporting Chromium, Firefox, and WebKit.
- Puppeteer: Primarily focused on automating Chrome or Firefox instances via Node.js.
=== Testing Integration === Several established testing frameworks incorporate headless browser capabilities into their apparatus:
- Capybara: Leverages Headless Chrome or WebKit to simulate end-user actions during testing.
- Jasmine: Defaults to Selenium but can be configured to utilize WebKit or Headless Chrome for environment execution.
- Cypress: A dedicated framework for front-end testing that supports headless operation.
- QF-Test: A tool for GUI-based automated testing that supports headless browser execution.
=== Alternative Approaches ===
An alternative strategy involves employing libraries that simulate browser APIs without launching a full rendering engine. For instance, Deno natively integrates certain browser APIs. In the Node.js ecosystem, jsdom provides the most comprehensive simulation of HTML parsing, cookie management, XHR requests, and partial JavaScript execution. While these alternatives are often faster, they typically lack full DOM rendering capabilities and exhibit limited support for complex DOM events compared to genuine headless instances.
