web-content-retriever-mcp

This implementation establishes an MCP-compliant nexus for facilitating dialogue between artificial intelligence agents and external web resources.

English | 中文文档

Architectural Blueprint

fetch-mcp/ ├── src/ # Source code directory │ ├── lib/ # Library files │ │ ├── fetchers/ # Resource acquisition modules │ │ │ ├── browser/ # Headless browser operations │ │ │ │ ├── BrowserFetcher.ts # Browser acquisition logic │ │ │ │ ├── BrowserInstance.ts # Browser lifecycle oversight │ │ │ │ └── PageOperations.ts # DOM interaction routines │ │ │ ├── node/ # Standard Node.js acquisition │ │ │ └── common/ # Shared acquisition utilities │ │ ├── utils/ # Auxiliary functional units │ │ │ ├── ChunkManager.ts # Data segmentation utility │ │ │ ├── ContentProcessor.ts # Transformation of HTML to textual format │ │ │ ├── ContentExtractor.ts # Algorithmic core content parsing │ │ │ ├── ContentSizeManager.ts # Constraints management for payloads │ │ │ └── ErrorHandler.ts # Exception management framework │ │ ├── server/ # Server operational modules │ │ │ ├── index.ts # Server initialization point │ │ │ ├── browser.ts # Browser orchestration │ │ │ ├── fetcher.ts # Core acquisition orchestration │ │ │ ├── tools.ts # Tool registration and command handling │ │ │ ├── resources.ts # Managed asset handling │ │ │ ├── prompts.ts # Predefined interaction templates │ │ │ └── types.ts # Server type definitions contract │ │ ├── i18n/ # Multilingual support implementation │ │ └── types.ts # Global type definitions │ ├── client.ts # MCP client wrapper implementation │ └── mcp-server.ts # Primary MCP server executable module ├── index.ts # Server execution startup file ├── tests/ # Quality assurance routines └── dist/ # Compiled output assets

MCP Communication Contract

The Model Context Protocol (MCP) mandates two primary conveyance mechanisms:

Standard I/O (Stdio): The orchestrating client initiates the MCP server as a subordinate process, establishing communication pathways via standard input (stdin) and standard output (stdout).
Server-Sent Events (SSE): Utilized for asynchronous message exchange between the initiating party and this server.

This project exclusively utilizes the Standard Input/Output (Stdio) transport mechanism.

Key Capabilities

Adherence to the official MCP SDK specifications.
Native support for Stdio communication protocol.
Versatile web acquisition methods (handling HTML, JSON, raw text, Markdown, and sanitized text conversion).
Adaptive Mode Switching: Seamless, automated transitions between direct HTTP requests and full browser execution contexts.
Payload segmentation: Automatically fractures excessively large retrieved data into smaller segments to comply with language model context windows.
Segmented data retrieval: Allows requesting specific data partitions based on context continuity requirements.
Detailed diagnostic logging piped to standard error (stderr).
Complete Bilingual Support (English and Simplified Chinese).
Highly modular architecture for enhanced maintainability and extensibility.
Algorithmic Content Distillation: Leverages the Mozilla Readability engine to isolate primary textual content, effectively eliminating extraneous elements like advertisements and navigational boilerplate.
Metadata Harvesting: Capability to pull critical page data, including document title, authorship, publication timelines, and source domain context.
Semantic Content Verification: Automated classification to discern pages containing substantive information versus transient pages (e.g., login forms, error messages).
Enhanced Browser Interaction: Provides functionalities for programmatic scrolling, cookie state management, and waiting for specific DOM elements to stabilize before capture.

Deployment Instructions

Smithery Integration

For automated deployment and integration with Claude Desktop via Smithery:

bash npx -y @smithery/cli install @lmcc-dev/mult-fetch-mcp-server --client claude

Local Setup

bash pnpm install

Global Availability

bash pnpm add -g @lmcc-dev/mult-fetch-mcp-server

Alternatively, execute directly without persistent installation via npx:

bash npx @lmcc-dev/mult-fetch-mcp-server

Integrating with Claude AI

To enable this utility within the Claude desktop environment, configuration must be added to the appropriate server manifest file:

Configuration File Location

MacOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%/Claude/claude_desktop_config.json

Configuration Examples

Option 1: Via npx (Recommended Path)

This is the most straightforward method, suitable for globally available executables or direct npx invocation:

{ "mcpServers": { "web-content-retriever-mcp": { "command": "npx", "args": ["@lmcc-dev/mult-fetch-mcp-server"], "env": { "MCP_LANG": "en" // Set desired locale: "zh" or "en" } } } }

Option 2: Explicit Binary Path

Use this when a precise installation location must be referenced:

{ "mcpServers": { "web-content-retriever-mcp": { "command": "path-to/bin/node", "args": ["path-to/@lmcc-dev/mult-fetch-mcp-server/dist/index.js"], "env": { "MCP_LANG": "en" // Set desired locale: "zh" or "en" } } } }

Note: Substitute the placeholder paths with your system's actual Node.js executable and project directory locations.

Once configured, restart the Claude application. The following tools become callable within your conversational context:

fetch_html: Retrieves the raw HTML structure.
fetch_json: Acquires structured JSON data.
fetch_txt: Fetches unformatted textual output.
fetch_markdown: Retrieves content formatted using Markdown syntax.
fetch_plaintext: Delivers HTML content stripped entirely of markup tags.

Compilation Process

bash pnpm run build

Executing the Server

bash pnpm run server

Alternatively, using the compiled output:

node dist/index.js

If installed globally:

@lmcc-dev/mult-fetch-mcp-server

Or via npx:

npx @lmcc-dev/mult-fetch-mcp-server

Client Demonstration Utilities

Crucial Caveat: The client.js script serves purely for validation and testing. In live deployment with Claude or similar systems, the AI orchestrator autonomously manages the execution of the segmenting functions.

CLI Testing Interface

Use this interface for development checks:

bash pnpm run client

Example invocation:

pnpm run client fetch_html '{"url": "https://example.com", "debug": true}'

Testing Segment Control Parameters

These arguments assist in simulating large file handling via the CLI:

--all-chunks: Flag to mandate sequential retrieval of every generated data segment.
--max-chunks: Sets an upper bound on the number of segments to process (default limit is 10).

Real-time Payload Streaming Demonstration

This demo client showcases immediate output as segments are processed:

bash node dist/src/client.js fetch_html '{"url":"https://example.com", "startCursor": 0, "contentSizeLimit": 500}' --all-chunks --debug

The client will sequentially pull all resulting data segments and display them instantly, verifying real-time large content assimilation.

Running Automated Checks

bash

Execute core MCP interaction tests

npm run test:mcp

Execute tests targeting the mini4k.com benchmark set

npm run test:mini4k

Execute direct client interface validation tests

npm run test:direct

Locale Configuration

This utility supports both Chinese and English interfaces. Language selection is managed via environment variables:

Environment Variable Control

Set the MCP_LANG variable to dictate the operational locale:

bash

Activate English mode

export MCP_LANG=en npm run server

Activate Chinese mode

export MCP_LANG=zh npm run server

Windows Command Shell:

set MCP_LANG=zh npm run server

Using environment variables ensures consistent language settings across all server subprocesses.

Default Locale Selection Logic

The system defaults language based on the following hierarchy: 1. Explicit setting of the MCP_LANG environment variable. 2. Inspection of the host OS language setting (if it begins with "zh", Chinese is selected). 3. English is the ultimate fallback position.

Diagnostic Output Management

In compliance with MCP standards, no operational logs are output by default to prevent corruption of the JSON-RPC stream. Diagnostics are explicitly enabled via request parameters:

Invoking the `debug` Parameter

Include "debug": true within any tool argument payload:

{ "url": "https://example.com", "debug": true }

Debug output is streamed to standard error (stderr) prefixed for source identification:

[MCP-SERVER] MCP server starting... [CLIENT] Fetching URL: https://example.com

Persistent Debug Log File

When debugging is active, a consolidated log is also written to:

~/.mult-fetch-mcp-server/debug.log

This file is accessible via the MCP resources API:

typescript // Fetching the contents of the diagnostic ledger const result = await client.readResource({ uri: "file:///logs/debug" }); console.log(result.contents[0].text);

// Command to empty the diagnostic ledger const clearResult = await client.readResource({ uri: "file:///logs/clear" }); console.log(clearResult.contents[0].text);

Proxy Configuration Strategy

This utility supports proxy configuration through several tiered mechanisms:

1. Direct Request Parameter Injection

Specify the proxy directly within the call arguments:

{ "url": "https://example.com", "proxy": "http://your-proxy-server:port", "debug": true }

2. Environment Variable Monitoring

The service automatically inspects and utilizes standard environment variables for proxy discovery:

bash export HTTP_PROXY=http://your-proxy-server:port export HTTPS_PROXY=http://your-proxy-server:port

Start server

npm run server

3. Host System Proxy Discovery

The tool attempts to query operating system-level proxy configurations:

Windows: Reads from environment settings via command utility.
macOS/Linux: Reads from environment settings via command utility.

4. Proxy Resolution Failure Protocol

If proxy integration proves difficult:

Activate debug: true to scrutinize the proxy detection logs.
Manually override using the proxy parameter.
Verify the proxy URL adheres to the required format (http://host:port or https://host:port).
For operations demanding browser execution, set useBrowser: true.

5. Browser Context Proxy Handling

When browser automation is invoked (useBrowser: true), the proxy resolution sequence is:

Explicitly provided proxy parameter.
Detected system-wide proxy settings.
Defaulting to no proxy if the above fail.

Browser mode is advantageous for navigating sites employing robust anti-scraping countermeasures or those requiring dynamic JavaScript interpretation.

Parameter Inheritance Logic

debug: Controlled on a per-request basis via the tool call arguments.
MCP_LANG: Inferred from environment variables; governs the global localization state of the server instance.

Operational Usage

Instantiating a Client Connector

typescript import { Client } from '@modelcontextprotocol/sdk/client/index.js'; import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js'; import path from 'path'; import { fileURLToPath } from 'url';

// Determine script directory const __filename = fileURLToPath(import.meta.url); const __dirname = path.dirname(__filename);

// Establish the communication pipe const transport = new StdioClientTransport({ command: 'node', args: [path.resolve(__dirname, 'dist/index.js')], stderr: 'inherit', env: { ...process.env // Forward all current environmental variables } });

// Initialize the primary client agent const client = new Client({ name: "example-client", version: "1.0.0" });

// Link client to the transport layer await client.connect(transport);

// Execute a tool call const result = await client.callTool({ name: 'fetch_html', arguments: { url: 'https://example.com', debug: true // Local diagnostic enablement } });

if (result.isError) { console.error('Acquisition Failure:', result.content[0].text); } else { console.log('Acquisition Success!'); console.log('Payload snippet:', result.content[0].text.substring(0, 500)); }

Available Toolset

fetch_html: Retrieve the Document Object Model (DOM) content.
fetch_json: Retrieve data encoded in JSON format.
fetch_txt: Retrieve raw, unformatted textual data.
fetch_markdown: Retrieve content conforming to Markdown specifications.
fetch_plaintext: Retrieve content sanitized of all HTML markup.

Resource Management Capabilities

The server supports the MCP resource listing (resources/list) and reading (resources/read) endpoints. While the framework is present for accessing project documentation and assets, this specific functionality remains undeveloped.

Resource Interaction Example

typescript // Query for accessible managed assets const resourcesResult = await client.listResources({}); console.log('Managed assets inventory:', resourcesResult);

// Note: This currently returns empty sets as no resources are defined.

Supported Prompt Templates

The system exposes several predefined interaction templates designed to guide complex requests:

fetch-website: A generic handler for retrieving site data across various output modalities and controlling browser activation.
extract-content: Optimized for isolating specific components via CSS selectors and enforcing data type contracts.
debug-fetch: A specialized template for diagnosing acquisition failures and suggesting remedies.

Prompt Template Utilization

Use prompts/list to survey the available template catalog.
Use prompts/get to retrieve the source text for a specific template.

typescript // Check available prompt templates const promptsResult = await client.listPrompts({}); console.log('Available Prompts:', promptsResult);

// Example: Requesting the website fetching prompt with parameters const fetchPrompt = await client.getPrompt({ name: "fetch-website", arguments: { url: "https://example.com", format: "html", useBrowser: "false" } }); console.log('Fetch Website Template:', fetchPrompt);

// Example: Generating a diagnostic prompt const debugPrompt = await client.getPrompt({ name: "debug-fetch", arguments: { url: "https://example.com", error: "Connection timeout" } }); console.log('Diagnostic Template:', debugPrompt);

Comprehensive Parameter Definitions

Every tool accepts the following arguments for fine-grained control:

Fundamental Settings

url: The Uniform Resource Locator to target (Mandatory).
headers: Custom HTTP request metadata (Optional, default: {}).
proxy: A string defining the proxy endpoint (http://host:port or https://host:port) (Optional).

Network Flow Controls

timeout: Maximum allowed duration for the operation, in milliseconds (Optional, default: 30000).
maxRedirects: Limit for URL redirection traversal (Optional, default: 10).
noDelay: Boolean flag to bypass randomized request spacing (Optional, default: false).
useSystemProxy: Boolean flag to engage system-level proxy lookups (Optional, default: true).

Payload Sizing & Segmentation

enableContentSplitting: Boolean to activate data segmentation for large payloads (Optional, default: true).
contentSizeLimit: Byte threshold triggering data fragmentation (Optional, default: 50000).
startCursor: Byte offset indicating where to resume data retrieval in segmented content (Optional, default: 0).

These parameters facilitate the management of substantial web data volumes that might exceed the operational buffer of the target AI architecture, enabling retrieval in manageable, contextually continuous blocks.

Segment Addressing

chunkId: A unique identifier assigned to a segmented data set, used subsequently to request the next partition.

When a resource is partitioned, the response includes metadata pointing to the identifier and starting position necessary for the AI to sequence requests for the remaining segments, ensuring complete data integrity via byte-level addressing.

Mode Selection Parameters

useBrowser: Boolean to force the execution environment to be a headless browser (Optional, default: false).
useNodeFetch: Boolean to strictly enforce the standard Node.js acquisition method (Optional, default: false; overrides useBrowser).
autoDetectMode: Boolean to enable heuristic switching to browser mode upon standard request failure (e.g., HTTP 403) (Optional, default: true). Setting this to false enforces the explicitly chosen acquisition method.

Browser Automation Specifics

waitForSelector: A CSS selector the browser must resolve before capture (Optional, default: 'body').
waitForTimeout: Maximum wait time in milliseconds for the selector resolution (Optional, default: 5000).
scrollToBottom: Boolean flag to execute a full page scroll operation in browser mode (Optional, default: false).
saveCookies: Boolean to persist session cookies across browser interactions (Optional, default: true).
closeBrowser: Boolean to terminate the browser process immediately after the call, regardless of outcome (Optional, default: false).

Content Parsing Controls

extractContent: Boolean to invoke the Readability algorithm for core content extraction (Optional, default false).
includeMetadata: Boolean to include structured metadata alongside extracted text (Optional, default false; dependent on extractContent).
fallbackToOriginal: Boolean to revert to the raw content if the extraction heuristic fails (Optional, default true; dependent on extractContent).

Diagnostic Flag

debug: Boolean to activate verbose diagnostic reporting (Optional, default false).

Distillation Feature Utilization

Employ the content distillation feature to obtain the substantive core of a web page, discarding peripheral noise like navigation, advertisements, and ancillary sidebars:

{ "url": "https://example.com/article", "extractContent": true, "includeMetadata": true }

The successful result will incorporate the following discovered metadata, if present: - Document Title - Author Byline - Source Website Name - Summary Excerpt - Content Byte Count - Readability Status (isReaderable flag)

Edge Case Operations

To isolate and extract content from a structurally complicated page where extraction might otherwise fail, activate the safeguard:

{ "url": "https://example.com/complex-layout", "extractContent": true, "fallbackToOriginal": true }

To instigate a shutdown of the persistent browser session without executing any retrieval task:

{ "url": "about:blank", "closeBrowser": true }

Proxy Precedence Rules

The mechanism for determining proxy settings follows this prioritized sequence: 1. Proxy string explicitly defined in the command invocation arguments. 2. The proxy argument provided within the tool's parameter JSON. 3. Environment variables (only if useSystemProxy is set to true). 4. System configuration retrieved via tools like Git (only if useSystemProxy is true).

Crucially, if the proxy parameter is present in the tool arguments, useSystemProxy is automatically set to false.

Diagnostic Output Channels

With debug: true enabled, output is routed to stderr and prefixed to identify the originating component: - [MCP-SERVER]: Server lifecycle and protocol handling messages. - [NODE-FETCH]: Logs originating from the standard Node.js HTTP client. - [BROWSER-FETCH]: Logs generated by the headless browser runtime. - [CLIENT]: Messages related to client-side command handling. - [TOOLS]: Operational logs from tool execution layers. - [FETCHER]: Logs from the unifying acquisition interface. - [CONTENT]: Logs concerning raw data processing and handling. - [CONTENT-PROCESSOR]: Logs specific to HTML-to-text transformation logic. - [CONTENT-SIZE]: Logs detailing payload segmentation and size constraints enforcement. - [CHUNK-MANAGER]: Logs detailing the logic of data partitioning and addressing. - [ERROR-HANDLER]: Logs from the centralized exception management system. - [BROWSER-MANAGER]: Logs governing the lifecycle and state of browser instances. - [CONTENT-EXTRACTOR]: Logs related to the Readability algorithm and content filtering routines.

Licensing

MIT

Revision by lmcc-dev WIKIPEDIA: Business management tools encompass all systems, applications, control mechanisms, computational solutions, and methodologies deployed by organizations to effectively navigate evolving market dynamics, secure competitive advantage, and enhance overall operational efficiency.

== General Overview == Management tools can be categorized by organizational department or functional aspect, such as planning, process management, record-keeping, human resources management, decision support, and oversight. The last decade has seen transformative shifts in these tools driven by rapid technological advancement, creating a challenge for enterprises in selecting optimal solutions. This perpetual drive for cost reduction, sales expansion, customer insight acquisition, and precise product delivery mandates a strategic approach to tool adoption, moving beyond simple reliance on the newest offerings. Management tools must be deliberately chosen and subsequently tailored to the specific operational needs of the organization, rather than forcing organizational structure to conform to the tool's design.

== Frequently Utilized Tools == Data from a 2013 Bain & Company survey revealed global patterns in business tool usage, reflecting regional needs shaped by economic conditions:

The top ten categories identified included:

Strategic planning frameworks Customer Relationship Management (CRM) systems Employee satisfaction surveys Performance benchmarking Balanced Scorecard implementation Core competency analysis Outsourcing governance Organizational change management programs Supply Chain Optimization Definition of corporate mission and vision Market segmentation analysis Total Quality Management (TQM)

== Enterprise Software Applications == Software solutions or collections of programs utilized by personnel to execute varied corporate functions are termed business applications. These tools aim to elevate productivity, quantify results, and perform complex corporate tasks with accuracy. The evolution proceeded from Management Information Systems (MIS) to comprehensive Enterprise Resource Planning (ERP) systems, later incorporating Customer Relationship Management (CRM), culminating in the current landscape dominated by cloud-based business management platforms. While a tangible correlation exists between IT investment and organizational performance, value realization hinges critically on two factors: the efficacy of the deployment process and the careful selection and customization of the necessary tools.

== Tools Tailored for Small and Medium Enterprises (SMEs) == Tools designed specifically for SMEs are vital as they furnish pathways for resource conservation, enabling smaller entities to compete effectively by leveraging focused, scalable technologies.

web-content-retriever-mcp

Author

lmcc-dev

Quick Info

Actions

Tags

web-content-retriever-mcp

Architectural Blueprint

MCP Communication Contract

Key Capabilities

Deployment Instructions

Smithery Integration

Local Setup

Global Availability

Integrating with Claude AI

Configuration File Location

Configuration Examples

Option 1: Via npx (Recommended Path)

Option 2: Explicit Binary Path

Compilation Process

Executing the Server

Alternatively, using the compiled output:

If installed globally:

Or via npx:

Client Demonstration Utilities

CLI Testing Interface

Example invocation:

Testing Segment Control Parameters

Real-time Payload Streaming Demonstration

Running Automated Checks

Execute core MCP interaction tests

Execute tests targeting the mini4k.com benchmark set

Execute direct client interface validation tests

Locale Configuration

Environment Variable Control

Activate English mode

Activate Chinese mode

Windows Command Shell:

Default Locale Selection Logic

Diagnostic Output Management

Invoking the debug Parameter

Persistent Debug Log File

Proxy Configuration Strategy

1. Direct Request Parameter Injection

2. Environment Variable Monitoring

Start server

3. Host System Proxy Discovery

4. Proxy Resolution Failure Protocol

5. Browser Context Proxy Handling

Parameter Inheritance Logic

Operational Usage

Instantiating a Client Connector

Available Toolset

Resource Management Capabilities

Resource Interaction Example

Supported Prompt Templates

Prompt Template Utilization

Comprehensive Parameter Definitions

Fundamental Settings

Network Flow Controls

Payload Sizing & Segmentation

Segment Addressing

Mode Selection Parameters

Browser Automation Specifics

Content Parsing Controls

Diagnostic Flag

Distillation Feature Utilization

Edge Case Operations

Proxy Precedence Rules

Diagnostic Output Channels

Licensing

See Also

Invoking the `debug` Parameter