unified-web-data-retriever-mcp-service
A Model Context Protocol (MCP) service engineered to procure remote web assets, supporting diverse payloads like structured data (JSON) and hypertext (HTML), featuring advanced extraction logic and integrated bilingual (English/Chinese) operational capacity.
Author

lmcc-dev
Quick Info
Actions
Tags
unified-web-data-retriever-mcp-service
This component furnishes an MCP-compliant conduit for interaction between advanced AI agents and external data sources.
English | 中文文档
Component Architecture Overview
fetch-mcp/ ├── src/ # Source code root │ ├── lib/ # Core library modules │ │ ├── fetchers/ # Retrieval implementations │ │ │ ├── browser/ # Headless browser utility set │ │ │ │ ├── BrowserFetcher.ts # Browser interaction logic │ │ │ │ ├── BrowserInstance.ts # Browser lifecycle management │ │ │ │ └── PageOperations.ts # In-page actions │ │ │ ├── node/ # Standard HTTP fetching (Node.js based) │ │ │ └── common/ # Shared transport utilities │ │ ├── utils/ # Auxiliary modules │ │ │ ├── ChunkManager.ts # Data segmentation utility │ │ │ ├── ContentProcessor.ts # Transformation from HTML to clean text │ │ │ ├── ContentExtractor.ts # Sophisticated data capture logic │ │ │ ├── ContentSizeManager.ts # Mechanisms for size governance │ │ │ └── ErrorHandler.ts # Exception management routines │ │ ├── server/ # Server-side framework components │ │ │ ├── index.ts # Server bootstrap │ │ │ ├── browser.ts # Browser orchestration │ │ │ ├── fetcher.ts # Unified fetching API │ │ │ ├── tools.ts # Tool registration handlers │ │ │ ├── resources.ts # Asset management interfaces │ │ │ ├── prompts.ts # Predefined instruction templates │ │ │ └── types.ts # Server-side type definitions │ │ ├── i18n/ # Language localization files │ │ └── types.ts # Shared data structures │ ├── client.ts # MCP client interface code │ └── mcp-server.ts # Primary server execution file ├── index.ts # Application entry point ├── tests/ # Verification scripts └── dist/ # Compiled output directory
MCP Protocol Implementation Details
The Model Context Protocol (MCP) mandates two primary communication paradigms:
- Standard Input/Output (Stdio): In this mode, the consuming client initiates the MCP service as a subordinate process, with message exchange occurring via standard I/O streams (stdin/stdout).
- Server-Sent Events (SSE): Utilized for streaming messages between the initiator and the service.
This particular implementation exclusively utilizes the Stdio transport mechanism.
Key Capabilities
- Adherence to the official MCP SDK specifications.
- Robust support for Stdio communication.
- Versatile web acquisition methods (handling HTML, JSON, raw text, Markdown, and clean text transformation).
- Adaptive Mode Selection: Intelligent toggling between lightweight requests and full browser simulation.
- Context Limitation Mitigation: Automatic segmentation of overly large retrieved documents into manageable segments for constrained AI context windows.
- Segmented Retrieval: Capacity to request arbitrary segments of previously segmented content, ensuring contextual integrity.
- Detailed diagnostics streamed to standard error (stderr).
- Full bilingual support (English and Mandarin Chinese).
- Modular codebase structure for simplified upkeep and expansion.
- Advanced Content Sifting: Leverages Mozilla's Readability framework to isolate primary content, effectively suppressing parasitic elements like advertisements and navigational clutter.
- Meta-data Capture: Extraction of key webpage attributes, including document title, authorship, publishing timestamp, and source site identification.
- Substantial Content Validation: Automated checking to filter out non-substantive pages such as login portals or error screens.
- Browser Feature Augmentation: Capabilities covering page manipulation (scrolling), session cookie handling, explicit element waiting, and other sophisticated browser interactions.
Deployment
Acquisition via Smithery
For automated deployment of Unified Web Data Retriever MCP Service into Claude Desktop using Smithery:
bash npx -y @smithery/cli install @lmcc-dev/mult-fetch-mcp-server --client claude
Local Setup
bash pnpm install
Global Installation
bash pnpm add -g @lmcc-dev/mult-fetch-mcp-server
Alternatively, execute directly using npx without persistent installation:
bash npx @lmcc-dev/mult-fetch-mcp-server
Integration with Claude Desktop
To enable this service within the Claude desktop application, modify the server configuration file:
Configuration File Location
- MacOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%/Claude/claude_desktop_config.json
Configuration Blueprints
Method 1: Utilizing npx (Recommended)
This approach abstracts the need for absolute path specification and works well with global installations or direct npx invocation:
{ "mcpServers": { "unified-web-data-retriever-mcp-service": { "command": "npx", "args": ["@lmcc-dev/mult-fetch-mcp-server"], "env": { "MCP_LANG": "en" // Language setting: "zh" or "en" } } } }
Method 2: Specifying Absolute Path
Use this if targeting a specific installation directory:
{ "mcpServers": { "unified-web-data-retriever-mcp-service": { "command": "path-to/bin/node", "args": ["path-to/@lmcc-dev/mult-fetch-mcp-server/dist/index.js"], "env": { "MCP_LANG": "en" // Language setting: "zh" or "en" } } } }
Please substitute placeholder paths with your actual Node.js executable location and the project's root directory.
Usage Demonstration
Upon successful configuration and restarting Claude, the following tools become available for interaction:
fetch_html: Retrieves the raw hypertext structure.fetch_json: Obtains data formatted as JSON.fetch_txt: Fetches raw, unformatted text.fetch_markdown: Fetches content represented in Markdown syntax.fetch_plaintext: Retrieves HTML content stripped entirely of markup tags.
Compilation
bash pnpm run build
Executing the Service
bash pnpm run server
Alternatively
node dist/index.js
If globally installed
@lmcc-dev/mult-fetch-mcp-server
Or via npx
npx @lmcc-dev/mult-fetch-mcp-server
Development Client Utilities
Note: The
client.jsutility is intended strictly for prototyping and validation. When integrated with Claude or similar systems, the AI orchestrates the service, handling chunking logic transparently.
Command Line Interface
bash
pnpm run client
Example invocation
pnpm run client fetch_html '{"url": "https://example.com", "debug": true}'
Demo Client Chunk Control Arguments
Parameters available when testing segmentation using the CLI client:
--all-chunks: Flag to sequentially pull every generated segment (testing utility only).--max-chunks: Cap on the total number of segments retrieved (default limit set to 10).
Live Output Showcase
The client.js script supports real-time data streaming:
bash node dist/src/client.js fetch_html '{"url":"https://example.com", "startCursor": 0, "contentSizeLimit": 500}' --all-chunks --debug
This command demonstrates sequential fetching and immediate display of data segments, highlighting real-time large content processing.
Verification Suite
bash
Execute tests for MCP protocol interactions
npm run test:mcp
Execute regression tests against mini4k.com data
npm run test:mini4k
Execute direct client function call verification
npm run test:direct
Language Configuration
Bilingual support (English and Chinese) is configurable via an environment variable:
Environment Variable Control
Set the MCP_LANG variable to dictate the operational language:
bash
Engage English mode
export MCP_LANG=en npm run server
Engage Chinese mode
export MCP_LANG=zh npm run server
Windows context
set MCP_LANG=zh npm run server
Environment variables ensure consistent language application across all related service processes.
Default Language Selection Hierarchy
The system defaults language based on this priority:
1. Explicitly set MCP_LANG environment variable.
2. OS locale check (if language starts with "zh", select Chinese).
3. English (final fallback).
Diagnostic Logging
Per MCP specification, operational logs are suppressed by default to prevent corruption of the JSON-RPC stream. Diagnostics are enabled via request parameters:
Utilizing the debug Flag
Enable verbosity for specific tool invocations:
{ "url": "https://example.com", "debug": true }
Verbose output prefixes utilize standard error (stderr) as follows:
[MCP-SERVER] MCP server starting... [CLIENT] Fetching URL: https://example.com
Persistent Debug Log
When debugging is active, all diagnostic streams are mirrored to a file located at:
~/.mult-fetch-mcp-server/debug.log
This log file is accessible through the MCP resources interface:
typescript // Fetching the debug log content const result = await client.readResource({ uri: "file:///logs/debug" }); console.log(result.contents[0].text);
// Command to erase the debug log const clearResult = await client.readResource({ uri: "file:///logs/clear" }); console.log(clearResult.contents[0].text);
Proxy Configuration Strategies
This component accommodates proxy settings via several mechanisms:
1. Explicit Parameter Inclusion
Define the proxy directly within the tool arguments:
{ "url": "https://example.com", "proxy": "http://your-proxy-server:port", "debug": true }
2. Environment Variable Interception
The service automatically recognizes and utilizes standard system proxy environment variables:
bash
Configure proxy variables
export HTTP_PROXY=http://your-proxy-server:port export HTTPS_PROXY=http://your-proxy-server:port
Start the service
npm run server
3. Operating System Setting Discovery
The service attempts to dynamically query system-level proxy configurations:
- Windows: Queries environment variables using system commands.
- macOS/Linux: Queries environment variables using system commands.
4. Proxy Troubleshooting Guide
If proxy functionality is inconsistent:
- Activate
debug: trueto inspect detailed proxy detection logs. - Manually specify the proxy via the
proxyparameter. - Verify the proxy URL adheres to the
http://host:portorhttps://host:portstructure. - If browser emulation is required, ensure
useBrowser: trueis set.
5. Browser Mode Proxy Behavior
When employing browser simulation (useBrowser: true), proxy resolution follows this precedence:
- Explicitly supplied proxy in the request parameters.
- Detected system-wide proxy settings.
- No proxy utilized if previous checks fail.
Browser mode is critical for resources protected by anti-scraping mechanisms that require dynamic rendering.
Parameter Governance
The system manages configuration inputs as follows:
- debug: Governed on a per-request basis via call arguments.
- MCP_LANG: Inferred from environment variables, governing the global server language context.
Operational Usage
Instantiating a Client (SDK Example)
typescript import { Client } from '@modelcontextprotocol/sdk/client/index.js'; import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js'; import path from 'path'; import { fileURLToPath } from 'url';
// Determine current file location const __filename = fileURLToPath(import.meta.url); const __dirname = path.dirname(__filename);
// Setup transport layer via child process communication const transport = new StdioClientTransport({ command: 'node', args: [path.resolve(__dirname, 'dist/index.js')], stderr: 'inherit', env: { ...process.env // Propagate existing environment variables } });
// Initialize the client interface const client = new Client({ name: "example-client-app", version: "1.0.0" });
// Establish connection await client.connect(transport);
// Invoke a tool call const result = await client.callTool({ name: 'fetch_html', arguments: { url: 'https://example.com', debug: true // Local debug override } });
if (result.isError) { console.error('Retrieval Failure:', result.content[0].text); } else { console.log('Retrieval Successful!'); console.log('Content Snippet:', result.content[0].text.substring(0, 500)); }
Available Functions (Tools)
fetch_html: Retrieves the complete HTML source.fetch_json: Retrieves data structured as JSON.fetch_txt: Retrieves content as raw, unformatted text.fetch_markdown: Retrieves content formatted using Markdown syntax.fetch_plaintext: Retrieves content derived from HTML, stripped of all tags.
Resource Management Capabilities
The service supports MCP's resource listing and reading methods, though no internal resources are pre-registered. This framework is designed for accessing documentation or component files.
Resource Interaction Example
typescript // Query for accessible resources const resourcesResult = await client.listResources({}); console.log('Discovered resources:', resourcesResult);
// Note: This currently yields empty lists until internal resources are defined.
Prompt Template System
The server offers pre-compiled instruction sets for common operations:
fetch-website: General web data retrieval, adjustable for format and browser utilization.extract-content: Focused data capture using CSS selectors and output type specification.debug-fetch: Diagnostic template for analyzing fetch failures and proposing remedies.
Prompt Interaction
- Use
prompts/listto enumerate available templates. - Use
prompts/getto retrieve the content of a specific template.
typescript // List available prompt definitions const promptsResult = await client.listPrompts({}); console.log('Available prompts:', promptsResult);
// Example: Generating a prompt to fetch website content const fetchPrompt = await client.getPrompt({ name: "fetch-website", arguments: { url: "https://example.com", format: "html", useBrowser: "false" } }); console.log('Fetch website prompt details:', fetchPrompt);
// Example: Diagnostic prompt generation const debugPrompt = await client.getPrompt({ name: "debug-fetch", arguments: { url: "https://example.com", error: "Connection timeout" } }); console.log('Debug fetch prompt details:', debugPrompt);
Universal Parameter Set
All retrieval operations accept the following parameters:
Core Parameters
url: The Uniform Resource Locator to target (Mandatory).headers: Custom HTTP request metadata (Optional, defaults to empty object).proxy: Proxy server address (e.g.,http://host:port) (Optional).
Network Configuration
timeout: Maximum duration for the request in milliseconds (Optional, default 30000).maxRedirects: Limit on HTTP redirection hops (Optional, default 10).noDelay: Boolean flag to disable introduced request backoff periods (Optional, default false).useSystemProxy: Boolean flag to permit utilization of OS proxy configurations (Optional, default true).
Segment Management (For Large Content)
enableContentSplitting: Activates automatic segmentation of large documents (Optional, default true).contentSizeLimit: Maximum byte size per segment prior to splitting (Optional, default 50000 bytes).startCursor: Byte index specifying the retrieval starting point (Optional, default 0).
These parameters facilitate the controlled acquisition of expansive web documents, ensuring data fits within downstream AI context boundaries via precise byte-level chunking.
chunkId: Identifier assigned to a segmented content set, used when requesting subsequent portions.
Segmented responses furnish metadata enabling subsequent calls using chunkId and startCursor for uninterrupted content flow.
Mode Selection
useBrowser: Forces execution within the headless browser environment (Optional, default false).useNodeFetch: Forces standard Node.js HTTP client usage (Optional, default false; mutually exclusive withuseBrowser).autoDetectMode: If standard fetch fails (e.g., 403 response), automatically switch to browser mode (Optional, default true). Set to false for strict mode adherence.
Browser-Specific Arguments
waitForSelector: CSS selector the browser must confirm visibility for before proceeding (Optional, default 'body').waitForTimeout: Maximum wait time for selector visibility in milliseconds (Optional, default 5000).scrollToBottom: Boolean flag to trigger a full page scroll operation (Optional, default false).saveCookies: Boolean flag to retain session cookies across subsequent browser interactions (Optional, default true).closeBrowser: Boolean flag to immediately terminate the browser instance post-operation (Optional, default false).
Content Refinement Parameters
extractContent: Boolean to invoke the core content extraction algorithm (Optional, default false).includeMetadata: Boolean to include structural metadata alongside extracted text (Optional, default false; requiresextractContentto be true).fallbackToOriginal: Boolean to revert to raw content if intelligent extraction fails (Optional, default true; requiresextractContentto be true).
Diagnostic Parameter
debug: Activates verbose logging output (Optional, default false).
Content Refinement Feature Detail
Employ this feature to isolate essential article text, discarding peripheral elements like banners and sidebars:
{ "url": "https://example.com/article", "extractContent": true, "includeMetadata": true }
Successful extraction yields metadata including:
- Title
- Author/Byline
- Source Title
- Summary Excerpt
- Content Byte Length
- Readability Confirmation Flag (isReaderable)
Specialized Operations
To command the closure of the active browser session without executing any data fetching task:
{ "url": "about:blank", "closeBrowser": true }
Proxy Precedence
Proxy configuration resolution order:
1. Value provided in the command line arguments (if applicable).
2. Value specified in the request's proxy parameter.
3. Environment variables (if useSystemProxy remains true).
4. Git configuration settings (if useSystemProxy remains true).
Setting the proxy parameter overrides the useSystemProxy flag, forcing direct parameter use.
Diagnostic Output Details
When debug: true is active, logs directed to stderr are prefixed to identify the generating subsystem:
- [MCP-SERVER]: Service framework logs.
- [NODE-FETCH]: Node.js native transport logs.
- [BROWSER-FETCH]: Headless browser transport logs.
- [CLIENT]: Consumer application interaction logs.
- [TOOLS]: Tool logic execution logs.
- [FETCHER]: High-level retrieval interface logs.
- [CONTENT]: Data structure and content handling logs.
- [CONTENT-PROCESSOR]: HTML parsing and text conversion logs.
- [CONTENT-SIZE]: Data segmentation governance logs.
- [CHUNK-MANAGER]: Segment ordering and retrieval logs.
- [ERROR-HANDLER]: Exception reporting system logs.
- [BROWSER-MANAGER]: Browser lifecycle control logs.
- [CONTENT-EXTRACTOR]: Intelligent content isolation logs.
Licensing
Licensed under MIT
Updated by lmcc-dev

