BrowserControl-via-AccessibilityTree
Facilitates controlled manipulation of web environments using structured document object model (DOM) artifact representations, entirely independent of visual rendering analysis. Supports tasks like site traversal, data ingestion from input fields, information retrieval, and creation of automated functional validation suites within a browser context.
Author

SleepyRabbit
Quick Info
Actions
Tags
Accessibility-Driven Browser Control Protocol (ADBCP)
This Model Context Protocol (MCP) service leverages the Playwright engine to grant sophisticated control over web browser instances. Interaction is managed via synthesized, structured accessibility hierarchy dumps, eliminating the necessity for pixel-based interpretation or computer vision models.
Core Capabilities
- Efficiency Focus: Employs Playwright's native accessibility layer, avoiding computationally expensive raster image processing.
- Cognitive Simplicity: Operates exclusively on structured, machine-readable data; no visual perception component required for core functionality.
- Predictable Execution: Guarantees highly consistent operation by relying on invariant structural data rather than volatile visual presentation.
Practical Applications
- Navigating complex web interfaces and populating extensive digital forms
- Extracting specific data points from semantically organized web content
- Orchestrating intricate browser behaviors for autonomous agents
- Developing robust, structure-aware automated regression testing scripts
Configuration Snippet
js { "mcpServers": { "browser_control": { "command": "npx", "args": [ "@playwright/mcp@latest" ] } } }
Integration into Development Environments (VS Code)
To deploy the BrowserControl-via-AccessibilityTree service within your IDE setup, utilize one of the following installation mechanisms:
Alternatively, direct installation via the terminal utility:
bash
For standard VS Code
code --add-mcp '{"name":"playwright","command":"npx","args":["@playwright/mcp@latest"]}'
bash
For VS Code Insiders
code-insiders --add-mcp '{"name":"playwright","command":"npx","args":["@playwright/mcp@latest"]}'
Once installed, the service is accessible for manipulation by your GitHub Copilot agent within the VS Code environment.
Service Runtime Parameters
The ADBCP service supports the following command-line adjustments:
--browser <engine>: Specifies the target rendering engine. Options include:chrome,firefox,webkit,msedge. Sub-variants for Chromium/Edge are also supported (e.g.,chrome-beta,msedge-dev). Default ischrome.--caps <feature_set>: A comma-delimited ledger of enabled extensions, such as tabs, pdf handling, history access, waiting mechanics, file I/O, or installation helpers. Default enables all features.--cdp-endpoint <connection_url>: Remote debugging protocol endpoint for direct connection.--executable-path <location>: Explicit file system path to the browser binary.--headless: Instructs the browser process to operate without a visible graphical interface (GUI). Headed mode is the default.--port <transport_port>: TCP port designated for Server-Sent Events (SSE) communication.--user-data-dir <directory>: Location for persistent browser profile data.--vision: Activates a mode relying on pixel data (screenshots) instead of the default ARIA/accessibility tree processing.
Profile Data Persistence Location
The Playwright MCP maintains its session profile at the following locations:
- Windows: %USERPROFILE%\AppData\Local\ms-playwright\mcp-chrome-profile
- macOS: ~/Library/Caches/ms-playwright/mcp-chrome-profile
- Linux: ~/.cache/ms-playwright/mcp-chrome-profile
This directory stores all session-related state; it is safe to purge this directory between distinct automation runs to ensure a clean slate.
Headless Operation Configuration (GUI Suppression)
This mode is ideal for background processing or batch task execution.
js { "mcpServers": { "browser_control": { "command": "npx", "args": [ "@playwright/mcp@latest", "--headless" ] } } }
Headed Mode on Display-less Linux Systems
When initiating a visible browser instance on a machine lacking a display server (or within non-GUI worker contexts), the service must be launched externally specifying a communication port (--port).
bash npx @playwright/mcp@latest --port 8931
Subsequently, the MCP client configuration must point to this external SSE conduit:
js { "mcpServers": { "browser_control": { "url": "http://localhost:8931/sse" } } }
Operational Modes
The service exposes two primary interaction paradigms:
- Structure Mode (Default): Prioritizes the accessibility snapshot for superior performance and structural reliability.
- Visual Interpretation Mode: Leverages full-page raster images for interaction.
To engage Visual Interpretation Mode, supply the --vision flag during server bootstrap:
js { "mcpServers": { "browser_control": { "command": "npx", "args": [ "@playwright/mcp@latest", "--vision" ] } } }
Visual Interpretation Mode is best suited for agents designed to map actions onto specific X/Y pixel coordinates derived from the provided visual input.
Direct Programming Interface (Custom Transports)
js import { createServer } from '@playwright/mcp';
// ... initialization logic ...
const server = createServer({ launchOptions: { headless: true } }); transport = new SSEServerTransport("/messages", res); server.connect(transport);
Structure-Based Operational Definitions
- structure_click
- Rationale: Execute a primary activation event on a webpage element.
-
Arguments:
element(string): Descriptive text aiding in element identification (for permission/context).ref(string): The unique identifier referencing the element within the current page structure dump.
-
structure_hover
- Rationale: Position the cursor over a designated element.
-
Arguments:
element(string): Contextual label for the element.ref(string): Specific structural identifier.
-
structure_move_and_drop
- Rationale: Initiate a drag action from a source to a destination element.
-
Arguments:
startElement(string): Contextual label for the source element.startRef(string): Source element's unique structural identifier.endElement(string): Contextual label for the destination element.endRef(string): Destination element's unique structural identifier.
-
structure_input_text
- Rationale: Inject sequential characters into an interactive field.
-
Arguments:
element(string): Contextual label for the input target.ref(string): Specific structural identifier.text(string): The data sequence to input.submit(boolean, optional): If true, simulates final submission (Enter key).slowly(boolean, optional): If true, types character-by-character to trigger per-character event handlers.
-
structure_select_option_from_list
- Rationale: Choose one or more options from a defined selection control (e.g.,
<select>). -
Arguments:
element(string): Contextual label for the dropdown/selector.ref(string): Specific structural identifier.values(array): List of values intended for selection.
-
structure_dump_accessibility
- Rationale: Generate and retrieve the current document's complete accessibility tree representation (superior to visual captures).
-
Arguments: None
-
structure_capture_visual
- Rationale: Acquire a raster image of the current viewport. This output is not intended for subsequent structural commands.
- Arguments:
raw(boolean, optional): If true, output is uncompressed PNG data; otherwise, defaults to compressed JPEG format.
Visual-Coordinate Based Operations
- screen_relocate_cursor
- Rationale: Move the input pointer to a precise location on the screen.
-
Arguments:
element(string): Contextual identifier.x(number): Horizontal coordinate.y(number): Vertical coordinate.
-
screen_capture_image
- Rationale: Generate a full-viewport screenshot.
-
Arguments: None
-
screen_trigger_click
- Rationale: Initiate a primary mouse click at specified coordinates.
-
Arguments:
element(string): Contextual identifier.x(number): Horizontal position for the click event.y(number): Vertical position for the click event.
-
screen_perform_drag
- Rationale: Simulate the holding and releasing of the primary mouse button between two points.
-
Arguments:
element(string): Contextual identifier.startX(number): Origin X position.startY(number): Origin Y position.endX(number): Termination X position.endY(number): Termination Y position.
-
screen_inject_text
- Rationale: Feed keyboard input sequence to the focused context.
-
Arguments:
text(string): The sequence of characters to input.submit(boolean, optional): Execute 'Enter' upon completion.
-
screen_simulate_key_event
- Rationale: Generate a dedicated key press event.
- Arguments:
key(string): The name of the required key (e.g.,Escape,Enter, or a literal character likek).
Tab Organization Control
- tab_enumerate
- Rationale: Retrieve a list of all currently active browser tabs.
-
Arguments: None
-
tab_initiate_new
- Rationale: Open a fresh browsing context.
-
Arguments:
url(string, optional): Initial destination address. Defaults to an empty page if omitted.
-
tab_activate_by_index
- Rationale: Switch focus to a tab based on its sequential position.
-
Arguments:
index(number): The zero-based index of the target tab.
-
tab_terminate
- Rationale: Close a specific tab context.
- Arguments:
index(number, optional): Index of the tab to dispose of. Defaults to the currently active tab if unspecified.
Context Shifting
- context_go_to_address
- Rationale: Load a specified Uniform Resource Locator (URL).
-
Arguments:
url(string): The absolute destination address.
-
context_revert_previous
- Rationale: Navigate backward in the history stack.
-
Parameters: None
-
context_advance_next
- Rationale: Navigate forward in the history stack.
- Parameters: None
Input Subsystem
- input_simulate_key_press
- Rationale: Register a single key press event on the keyboard.
- Arguments:
key(string): The identifier for the key to be activated (e.g.,F5, or characterp).
Diagnostics
- diagnostics_fetch_console_output
- Rationale: Collect all accumulated messages logged to the browser's console.
- Parameters: None
Data Transfer and Export
- transfer_upload_local_files
- Rationale: Simulate the user selection of files for submission.
-
Arguments:
paths(array): Absolute file system paths of the resources to be uploaded.
-
transfer_export_as_pdf
- Rationale: Render the current document content into a Portable Document Format (PDF) file.
- Parameters: None
Auxiliary Functions
- aux_pause_execution
- Rationale: Introduce a mandatory temporal delay in the automation sequence.
-
Arguments:
time(number): Duration to pause, measured in seconds (capped for safety at 10s).
-
aux_modify_viewport_dimensions
- Rationale: Adjust the visible rendering size of the browser window.
-
Arguments:
width(number): Target horizontal pixel dimension.height(number): Target vertical pixel dimension.
-
aux_manage_browser_prompts
- Rationale: Programmatically respond to system-level browser interruptions (alerts, confirmations, input requests).
-
Arguments:
accept(boolean): Determines acceptance or dismissal of the prompt.promptText(string, optional): Content to supply if the dialog type requires text input.
-
aux_terminate_page_context
- Rationale: Shut down the current document view.
-
Parameters: None
-
aux_ensure_browser_availability
- Rationale: Verify and install necessary browser components if they are missing based on initial configuration checks.
- Parameters: None
WIKIPEDIA: A headless browser is a web browser without a graphical user interface. Headless browsers offer automated manipulation of web pages within an environment mirroring standard user agents, but they execute strictly via command-line interfaces or network protocols. They are profoundly valuable for functional validation, as they accurately interpret and render HTML content, including CSS styling (layout, color, typography) and execute JavaScript/AJAX routines, capabilities often inaccessible via non-browser testing modalities. Since Chrome v59 and Firefox v56, native remote control APIs have been standardized, largely superseding older solutions like PhantomJS.
== Primary Applications == The principal domains utilizing headless browsers include:
Automated functional validation for contemporary web frameworks (web testing). Generating static visual representations (screenshots) of dynamic pages. Executing automated tests for complex JavaScript libraries. Programmatic control over web page interactions.
=== Secondary Uses === Headless agents are also employed in web data aggregation (scraping). Google acknowledged in 2009 that headless agents could improve search engine indexing for sites heavily reliant on Ajax. Conversely, misuse has been documented, such as generating artificial traffic (DDoS), inflating advertising metrics, or automating site manipulation beyond intended scope (e.g., credential testing). However, empirical analysis from 2018 suggests no inherent preference among malicious actors for headless agents over standard browsers for executing attacks like SQL injection or XSS.
== Implementation Landscape == Given that major browser engines now incorporate native headless modes via standardized interfaces, several software solutions consolidate this automation layer:
Selenium WebDriver – A standard compliant (W3C) implementation of the WebDriver protocol. Playwright – A Node.js utility designed for unifying control over Chromium, Firefox, and WebKit. Puppeteer – A Node.js library focused on controlling Chrome or Firefox instances.
=== Test Automation Integration === Many testing harnesses integrate headless browsers into their execution apparatus.
Capybara utilizes headless browsing (via WebKit or Headless Chrome) to simulate user actions within its testing mandates. Jasmine defaults to Selenium but permits configuration for WebKit or Headless Chrome for environmental testing. Cypress, a dedicated frontend testing framework. QF-Test, a tool for GUI-based automated software validation, which supports headless execution.
=== Alternative Approaches ===
An alternative methodology involves utilizing environments that expose browser-like APIs without full rendering. Deno integrates several browser APIs directly. For Node.js environments, jsdom offers the most comprehensive simulation. While these alternatives often support core features (HTML parsing, cookies, XHR, limited JS), they generally lack full DOM rendering capabilities and have restricted event model support, typically resulting in faster execution than full browser simulation.
