Steel Web Orchestrator

https://github.com/user-attachments/assets/25848033-40ea-4fa4-96f9-83b6153a0212

This server implements the Model Context Protocol (MCP) specification, providing access to sophisticated browser control mechanisms powered by Puppeteer and the Steel ecosystem. It allows generative models, such as Claude, to execute complex web tasks by issuing commands for actions like element interaction (clicks, text input), viewport manipulation (scrolling), and capturing the visual state of web pages (screenshots).

Instruct the integrated model to perform tasks such as: - "Locate a specific cooking instruction set and compile the required material list." - "Monitor and report the current status of a pending shipment." - "Perform comparative price analysis across several e-commerce sites for a designated item." - "Automate the entry of personal details into a multi-step online registration portal."

⚡️ Rapid Setup Guide

This streamlined procedure details initiating the Steel Voyager framework within the Claude Desktop environment. Configuration adjustments are minor to switch the operational backend between the managed Steel Cloud service and a self-contained local deployment.

Prerequisites

Ensure up-to-date installations of Git and Node.js.
Have the Claude Desktop client installed.
(Optional) If targeting a self-hosted setup, initiate the requisite Steel Docker container.
(Optional) If utilizing Steel Cloud, secure and retrieve your authentication token from Steel Developer Settings.

A) Cloud Deployment Start Sequence

Retrieve and build the source code repository:

bash git clone https://github.com/steel-dev/steel-mcp-server.git cd steel-mcp-server npm install npm run build

Modify the Claude Desktop configuration file (typically located at ~/Library/Application Support/Claude/claude_desktop_config.json) by inserting the following server definition:

json { "mcpServers": { "steel-puppeteer": { "command": "node", "args": ["path/to/steel-voyager/dist/index.js"], "env": { "STEEL_LOCAL": "false", "STEEL_API_KEY": "YOUR_STEEL_API_KEY_HERE", "GLOBAL_WAIT_SECONDS": "1" } } } }

Substitute YOUR_STEEL_API_KEY_HERE with your active Steel credential.
Confirm that the STEEL_LOCAL flag is explicitly set to "false" for cloud operation.
Initiate the Claude Desktop application. It will automatically instantiate this MCP service in its designated Cloud mode.
(Optional) Active browser sessions managed by Steel can be monitored or controlled via your central Steel management portal.

B) Local / Self-Hosted Steel Start Sequence

Confirm that your local or privately hosted Steel backend service is operational (e.g., running via the open-source Steel Docker image).
Fetch and compile the project code (if not previously completed):

bash git clone https://github.com/steel-dev/steel-mcp-server.git cd steel-mcp-server npm install npm run build

Update the Claude Desktop configuration (~/Library/Application Support/Claude/claude_desktop_config.json) to enable local connectivity:

json { "mcpServers": { "steel-puppeteer": { "command": "node", "args": ["path/to/steel-voyager/dist/index.js"], "env": { "STEEL_LOCAL": "true", "STEEL_BASE_URL": "http://localhost:3000", "GLOBAL_WAIT_SECONDS": "1" } } } }

The STEEL_LOCAL flag must be set to "true".
If operating Steel on a non-default endpoint, adjust STEEL_BASE_URL to reference that endpoint.
Launch Claude Desktop. It will now establish a connection to your local Steel instance and activate Steel Voyager in its disconnected mode.
(Optional) Visual inspection of active sessions locally can be done through your local Steel interface, often accessible at localhost:5173.

Setup complete! Upon Claude Desktop initialization, the MCP server will manage browser operations in the background, enabling interaction with digital interfaces via Steel Voyager's capabilities.

For supplementary configuration data or diagnostic help, consult the primary MCP setup documentation: https://modelcontextprotocol.io/quickstart/user

🛠️ Exposed Capabilities (Tools)

navigate
Direct the browser agent to a specified Uniform Resource Locator.
Inputs:
- url (string, mandatory): The target web address (e.g., "https://www.example.org").
search
Execute a query via the Google search engine by constructing the appropriate URL.
Inputs:
- query (string, mandatory): The textual phrase to be searched on Google.
click
Initiate an interaction (a press) on a page element identified by its visual label.
Inputs:
- label (number, mandatory): The numeric identifier assigned to the target component.
type
Insert textual content into an interactive form field designated by a label.
Inputs:
- label (number, mandatory): The identifier corresponding to the input control.
- text (string, mandatory): The character sequence to input.
- replaceText (boolean, optional): If true, discards existing text before insertion.
scroll_down
Move the viewport downwards along the page axis.
Inputs:
- pixels (integer, optional): The vertical distance in pixels to traverse. Default scrolls one screen height.
scroll_up
Move the viewport upwards along the page axis.
Inputs:
- pixels (integer, optional): The vertical distance in pixels to traverse upward. Default scrolls one screen height.
go_back
Revert the browser history to the preceding page state.
No inputs required
wait
Introduce a temporal pause, capped at 10 seconds, useful for asynchronous content loading scenarios.
Inputs:
- seconds (number, mandatory): Duration to pause, constrained between 0.0 and 10.0.
save_unmarked_screenshot
Capture the current display state without any graphical annotations (labels or boxes) and register it as an MCP artifact.
Inputs:
- resourceName (string, optional): The designated file identifier for storage. If omitted, a system-generated identifier is used.

Artifacts (Resources)

Captured Visuals (Screenshots): Every saved image is retrievable via an MCP resource identifier formatted as: • screenshot://RESOURCE_NAME

These images are generated either by explicit invocation of save_unmarked_screenshot or automatically upon completion of most command executions, providing a visual log. Retrieval follows standard MCP resource fetching protocols.

(Note: Internal system diagnostic logs are collected but are not exposed via resource URIs; they are only viewable within the server's operational logs.)

✨ Core Capabilities

Full browser control utilizing Puppeteer.
Seamless integration with Steel for session lifecycle management.
Precise identification of interactive components via numerical labeling.
High-fidelity visual capture functionality.
Support for core web interactions (navigation, input, clicking).
Enhanced functionality for dynamically loaded content through scroll commands.
Flexible deployment supporting both local execution and remote Steel services.

Visual Element Annotation Protocol

When operating on web interfaces, the Steel Puppeteer agent overlays visual aids to contextualize actionable items:

Each clickable or editable element receives a unique, sequential numeric tag.
Distinctive, colored bounding rectangles delineate element perimeters.
Numeric identifiers are positioned near or within the element for straightforward reference.
These numbers must be cited when specifying targets for click or type operations.

Operational Configuration

Steel Voyager operates in one of two states: "Local" or "Cloud." This selection is governed by environment parameters. Below is a summary of critical environment controls:

Variable	Default Setting	Purpose and Behavior
`STEEL_LOCAL`	"false"	Toggles execution mode: `true` for local/self-hosted connection, `false` for cloud connection.
`STEEL_API_KEY`	(None)	Mandatory only if `STEEL_LOCAL` is `"false"`. Required for authenticating against the Steel endpoint.
`STEEL_BASE_URL`	"https://api.steel.dev"	The root URI for the Steel API. Change this if connecting to a non-standard Steel instance. If `STEEL_LOCAL` is `"true"` and this is unset, it defaults to `"http://localhost:3000"`.
`GLOBAL_WAIT_SECONDS`	(None)	Optional. Defines a uniform delay (in seconds) applied post-action to accommodate slow page rendering or dynamic updates.

Local Mode Configuration

Set STEEL_LOCAL="true".
Optionally, specify a custom STEEL_BASE_URL if your Steel server is not on the default local port.
Authentication key is irrelevant in this mode.
Puppeteer establishes connection via the specified local WebSocket endpoint (e.g., ws://0.0.0.0:3000).

Example:

export STEEL_LOCAL="true"

export STEEL_BASE_URL="http://localhost:3000" # Only if deviating from default

Cloud Mode Configuration

Set STEEL_LOCAL="false".
Provide the STEEL_API_KEY to authenticate with the primary Steel cloud infrastructure.
STEEL_BASE_URL defaults to https://api.steel.dev; override for private cloud deployments.
Connection uses the secure WebSocket channel pointing to the cloud gateway.

Example:

export STEEL_LOCAL="false"

export STEEL_API_KEY="YOUR_STEEL_API_KEY_HERE"

Claude Desktop Integration Snippet

To enable the orchestration layer within Claude Desktop, inject configuration similar to the following into your profile settings (typically found at ~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "steel-puppeteer": {
      "command": "node",
      "args": ["path/to/steel-puppeteer/dist/index.js"],
      "env": {
        "STEEL_LOCAL": "false",
        "STEEL_API_KEY": "your_api_key_here"
      }
    }
  }
}

Adjust the environment settings within this block to align with your chosen operational context (Local vs. Cloud). This ensures Claude Desktop launches Steel Voyager correctly configured.

📥 Installation & Execution

Automated Installation via Smithery

Deploy Steel MCP Server to Claude Desktop automatically using the Smithery CLI:

npx -y @smithery/cli install @steel-dev/steel-mcp-server --client claude

Source Code Development Workflow

Fork and clone the official repository.
Install necessary node dependencies: bash npm install
Compile the TypeScript/source files: bash npm run build
Initiate the server process: bash npm start

Demo Showcase 📹

We challenged Claude to demonstrate its enhanced web interaction capabilities by investigating the newest advancements in the Sora model, subsequently visualizing the underlying mechanics and data structures.

https://github.com/user-attachments/assets/8d4293ea-03fc-459f-ba6b-291f5b017ad7

(Note: Video quality is reduced due to GitHub's file size restrictions.)

🛑 Troubleshooting Common Pitfalls

Connectivity/Authentication: Double-check that the Steel API key is valid for cloud usage, or confirm your local Steel service is active and accessible. Validate network paths.
Rendering Delays: If visual feedback seems misaligned or elements are missed, consider increasing the delay using the GLOBAL_WAIT_SECONDS environment variable in your client configuration.
Visual Integrity: Ensure the viewed webpage has completely finished its loading lifecycle and verify that the client viewport dimensions are adequate for capturing the required screen area.
Resource Management: The current implementation does not feature robust automatic session termination; manual cleanup of created browser instances may occasionally be necessary.
Prompt Engineering: Effectiveness is highly correlated with clear instruction design provided to the model.
Debugging: Use the associated session viewer utility to trace where the model's execution flow might be interrupted.
Performance Degradation: Expect increased latency after approximately 15-20 actions. This is often attributed to the accumulation of visual artifacts (screenshots) within the model's active context window, especially noticeable when interacting via the Claude Desktop client.

Collaboration

This implementation remains in an experimental phase and welcomes community contributions. To participate:

Fork the repository.
Establish a dedicated branch for your feature or fix.
Submit a comprehensive Pull Request, including:
A detailed explanation of the modification.
Rationale for the change.
Necessary documentation updates.

Legal Notice

⚠️ Experimental Status: This software is derived from the Web Voyager project. Deployment in mission-critical production environments is undertaken at the user's sole discretion.

WIKIPEDIA INSIGHT: A headless browser operates without a graphical user interface, allowing automated script control over web page rendering. This capability is crucial for testing, as it processes HTML, CSS, and JavaScript identically to a standard browser, but through a command-line or network interface. Modern browser versions (Chrome 59+, Firefox 56+) natively support remote control, largely superseding legacy tools like PhantomJS.

== Primary Applications == The core uses for headless browser technology involve:

Automated testing frameworks for contemporary web applications (Web Testing).
Generating static image captures of dynamic web content.
Executing JavaScript library validation routines.
Programmatic manipulation of web page states.

=== Secondary Utility === Headless browsers are also valuable for structured data acquisition (web scraping). Google supported their use for indexing sites relying heavily on Ajax. Conversely, misuse includes facilitating Denial of Service (DDoS) attacks, inflating ad impressions, or executing unauthorized automated site interactions (e.g., credential stuffing). However, broader traffic analysis suggests headless browsers are not disproportionately favored by malicious agents compared to conventional browsers for common attack vectors.

== Implementation Standards == As major browser vendors now offer native headless APIs, several abstraction layers exist to unify interaction across different engines:

Selenium WebDriver – Adheres to the W3C WebDriver standard.
Playwright – A versatile library supporting Chromium, Firefox, and WebKit automation.
Puppeteer – Focused primarily on automating Chrome or Firefox.

=== Test Orchestration Examples === Various testing frameworks incorporate headless browsing capabilities into their execution environments:

Capybara integrates headless browsing (via WebKit or Headless Chrome) to simulate user actions.
Jasmine frequently leverages Selenium, which can interface with headless environments.
Cypress, a specialized frontend testing utility.
QF-Test, which supports headless mode for GUI testing automation.

=== Alternatives to Full Rendering === An alternative strategy involves using software that exposes browser-like APIs directly. For example, Deno incorporates browser APIs into its core design. In the Node.js ecosystem, jsdom provides the most comprehensive simulation. While these tools handle many browser features (parsing, XHR, basic JS), they typically lack full DOM rendering and event simulation, leading to faster execution times than a true headless instance.