Web Agent Automation Framework (Puppeteer Backed)

This Model Context Protocol (MCP) gateway furnishes capabilities for automating web browser operations via Puppeteer, enabling manipulation of newly spawned browser contexts or attachment to pre-existing Chrome processes.

Project Lineage

This module represents an exploratory deployment inspired by prior work found at @modelcontextprotocol/server-puppeteer. It pursues comparable objectives but investigates distinct methodologies for orchestrating browser interaction under the MCP framework.

Core Functionality

Navigating to specified Uniform Resource Locators (URLs).
Capturing visual representations (screenshots) of the viewport or specific DOM nodes.
Simulating user clicks on interface elements.
Populating form fields with textual data.
Managing selection choices within <select> elements.
Executing mouse hover events.
Running arbitrary ECMAScript within the browser context.
Sophisticated management of Chrome tabs:
- Ability to link up with currently active Chrome browsing sessions.
- Retention of established Chrome processes across operations.
- Intelligent logic for connection establishment.

Structural Organization

/ (Root Directory) ├── src/ │ ├── config/ # Setup and parameter modules │ ├── tools/ # Definitions and logic for available functionalities │ ├── browser/ # Utilities for managing browser connections │ ├── types/ # TypeScript interface definitions │ ├── resources/ # Handlers for external assets │ └── server.ts # Main server initialization file ├── index.ts # Entry point script └── README.md # Primary documentation file

Deployment Instructions

Method 1: Installation via npm Registry

Execute the following command for global installation:

bash npm install -g puppeteer-mcp-server

Alternatively, invocation without persistent installation is possible via npx:

bash npx puppeteer-mcp-server

Method 2: Building from Source Code

Obtain a local copy of the repository (clone or download).
Install required package dependencies:

bash npm install

Compile the source code:

bash npm run build

Initiate the server:

bash npm start

Configuring the MCP Gateway for Claude

Integration with Claude requires updating the Model Context Protocol settings file.

For the Claude Desktop Application

Insert the relevant configuration block into your local Claude Desktop settings JSON file (path varies by OS: %APPDATA%\Claude\claude_desktop_config.json on Windows or ~/Library/Application Support/Claude/claude_desktop_config.json on macOS):

Global npm Install Scenario:

{ "mcpServers": { "puppeteer": { "command": "puppeteer-mcp-server", "args": [], "env": {} } } }

npx Execution Scenario:

{ "mcpServers": { "puppeteer": { "command": "npx", "args": ["-y", "puppeteer-mcp-server"], "env": {} } } }

Source Build Scenario:

{ "mcpServers": { "puppeteer": { "command": "node", "args": ["path/to/puppeteer-mcp-server/dist/index.js"], "env": { "NODE_OPTIONS": "--experimental-modules" } } } }

(Similar configuration blocks apply for the Claude VSCode Extension, targeting its respective settings file.)

Note for Source Install: Substitute path/to/puppeteer-mcp-server with the absolute filesystem location of your cloned repository.

Operational Modes

Default Operation

The server initiates a fresh, controlled browser instance automatically upon startup.

Existing Session Connection Mode

To interface with a running Chrome instance:

Ensure all existing Chrome processes are completely terminated.
Launch Chrome explicitly enabling remote debugging on port 9222: bash # Windows Path Example "C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222

macOS Path Example

/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222

Linux Example

google-chrome --remote-debugging-port=9222
Navigate to the intended web destination within that Chrome window.
Invoke the puppeteer_connect_active_tab mechanism with parameters:

{ "targetUrl": "https://example.com", // Target page URL (optional) "debugPort": 9222 // Port used for remote inspection (optional, defaults to 9222) }

The framework will subsequently: - Locate and attach to the Chrome process running with the specified remote inspection protocol. - Avoid terminating the persistent user session. - Identify and connect to standard web tabs (excluding internal Chrome extension pages). - Provide informative diagnostics if the connection procedure is unsuccessful.

Registered Toolset

`puppeteer_connect_active_tab`

Establishes a connection bridge to a Chrome environment initialized with remote debugging enabled. - Optional Parameters: * targetUrl: URI specifying the desired active tab. * debugPort: The TCP port utilized for Chrome's debugging interface (default is 9222).

`puppeteer_navigate`

Instructs the browser to load a new web address. - Required Parameter: * url: The complete address to load.

`puppeteer_screenshot`

Generates an image file capturing the current display state or a specific element. - Required Parameter: * name: Identifier for the resulting image file. - Optional Parameters: * selector: CSS path to isolate a specific region for capture. * width: Horizontal resolution for the capture viewport (default: 800 pixels). * height: Vertical resolution for the capture viewport (default: 600 pixels).

`puppeteer_click`

Simulates a mouse primary button click on a designated element. - Required Parameter: * selector: CSS identifier pointing to the clickable target.

`puppeteer_fill`

Inputs specified text content into an interactive form element (e.g., <input>). - Required Parameters: * selector: CSS identifier for the target input field. * value: The string data to be inserted.

`puppeteer_select`

Interacts with and chooses an option within a dropdown list component. - Required Parameters: * selector: CSS identifier for the <select> element. * value: The underlying value attribute of the option to be selected.

`puppeteer_hover`

Triggers a mouse hover event over a specific graphical element. - Required Parameter: * selector: CSS identifier for the element to target.

`puppeteer_evaluate`

Executes arbitrary JavaScript statements directly within the active browser's execution context. - Required Parameter: * script: The block of JavaScript code intended for execution.

Security Protocols

When utilizing the remote debugging interface: * Restrict access solely to trusted local or internal network environments. * Employ non-standard, unique port assignments for the debugging socket. * Ensure the remote debugging port listener is terminated immediately after operational requirements cease. * Absolutely avoid exposing the debugging endpoint to publicly accessible internet interfaces.

Operational Monitoring and Debugging Insights

Persistent File Logging

This server employs the Winston library for robust, structured logging:

Output Directory: logs/ subdirectory.
Filename Convention: mcp-puppeteer-YYYY-MM-DD.log.
Log Rotation Policy:
- Rotation frequency: Daily basis.
- Maximum file capacity: 20 Megabytes (MB) per file.
- Data retention period: 14 days.
- Automatic archival (compression) of expired log sets.

Severity Levels

DEBUG: Highly granular tracing information for deep diagnostics.
INFO: Standard updates on system status and key milestones.
WARN: Notifications regarding potential issues that do not halt execution.
ERROR: Records of failures, exceptions, and operational breakdowns.

Logged Data Scope

System launch and termination events.
Browser lifecycle actions (instantiation, connection finalization, closure).
Status reports on URL navigation requests and outcomes.
Execution status and return values for invoked tools.
Comprehensive error reports, including full stack traces.
Output captured from the browser's internal console.
Metadata regarding generated artifacts (screenshots, captured console logs).

Exception Management Strategy

The framework generates explicit diagnostic feedback for common issues, including: * Failure to establish a connection with the browser. * DOM elements specified by selectors cannot be located. * Invalid or syntactically incorrect CSS/XPath selectors provided. * Runtime errors encountered during JavaScript execution. * Problems during the screenshot capture process.

Every function call returns a structured response detailing: - A boolean indicating operational success or failure. - A descriptive message elucidating the cause of failure, if applicable. - The resulting data payload upon successful completion.

All generated exceptions are concurrently routed to the log files, enriched with: - Precise time-of-occurrence. - The core error description. - Associated call stack information (where determinable). - Relevant operational context identifiers.

Contribution Guidelines

We welcome external participation! Refer to our CONTRIBUTING.md document for instructions on submitting code enhancements, documenting bugs, and general project development participation.

Licensing

This software is distributed under the terms defined in the LICENSE file.

WIKIPEDIA: A headless browser is a web browser without a graphical user interface. Headless browsers provide automated control of a web page in an environment similar to popular web browsers, but they are executed via a command-line interface or using network communication. They are particularly useful for testing web pages as they are able to render and understand HTML the same way a browser would, including styling elements such as page layout, color, font selection and execution of JavaScript and Ajax which are usually not available when using other testing methods. Since version 59 of Google Chrome and version 56 of Firefox, there is native support for remote control of the browser. This made earlier efforts obsolete, notably PhantomJS.

== Use cases == The primary applications for headless browsing environments include:

Automated validation of modern web application functionality (web testing). Generating static visual captures of rendered web assets. Executing automated tests for front-end libraries and frameworks. Programmatic interaction and manipulation of web document structures.

=== Auxiliary Applications === Headless agents are also beneficial for intelligent web data retrieval (scraping). Google indicated in 2009 that employing a headless agent could aid their indexers in parsing content from sites heavily reliant on Ajax for content loading. Regrettably, headless agents have seen some malicious exploitation, such as:

Orchestrating Distributed Denial of Service (DDoS) attacks against web services. Artificially inflating advertisement impression counts. Automating interactions in ways not intended by the site owner (e.g., bulk credential submission). However, contemporary traffic analysis (2018 study) suggests that malicious actors do not demonstrate a significant preference for headless agents over standard browser environments for executing attacks like DDoS, SQL injection, or Cross-Site Scripting.

== Execution Methods == Given that several major browser vendors now natively integrate headless mode capabilities via dedicated APIs, specialized software bridges have emerged to unify this control layer. Prominent examples include:

Selenium WebDriver – Adheres to W3C WebDriver specifications for standardized control. Playwright – A comprehensive Node.js toolkit supporting Chromium, Firefox, and WebKit automation. Puppeteer – A Node.js module focused specifically on controlling Chrome or Firefox.

=== Automated Testing Integration === Many established software testing suites incorporate headless browsers into their testing apparatuses. Examples include:

Capybara, which utilizes either WebKit or Headless Chrome to simulate end-user behavior during protocol execution. Jasmine, defaulting to Selenium but configurable to use WebKit or Headless Chrome for browser-based tests. Cypress, a dedicated framework for end-to-end frontend testing. QF-Test, a commercial tool for GUI-based automated validation where headless environments are an option.

=== Alternative Approaches === An alternative path involves leveraging libraries that emulate browser APIs directly within the host runtime. For instance, Deno integrates browser-like APIs inherently. In the Node.js ecosystem, jsdom offers the most feature-complete DOM simulation. While these alternatives generally support foundational browser features (parsing HTML, handling cookies, making XHR requests, basic JavaScript execution), they typically lack full visual DOM rendering capabilities and have limited support for complex DOM events. Consequently, they often execute faster than full headless browser instances.

web-agent-automation-framework

Author

merajmehrabi

Quick Info

Actions

Tags