logo
Free, unlimited AI code reviews that run on commit
git-lrc git-lrc GitHub Install Now We'd appreciate a star git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt git-lrc - Free, unlimited AI code reviews that run on commit | Product Hunt

unified-web-data-retriever-mcp-service

A Model Context Protocol (MCP) service engineered to procure remote web assets, supporting diverse payloads like structured data (JSON) and hypertext (HTML), featuring advanced extraction logic and integrated bilingual (English/Chinese) operational capacity.

Author

unified-web-data-retriever-mcp-service logo

lmcc-dev

MIT License

Quick Info

GitHub GitHub Stars 13
NPM Weekly Downloads 0
Tools 1
Last Updated 2026-02-19

Tags

scrapingfetchapisscraping techniquesrequests lmccintelligent scraping

MseeP.ai Security Assessment Badge

unified-web-data-retriever-mcp-service

npm version License: MIT Node.js Version TypeScript MCP SDK GitHub Stars GitHub Forks GitHub Issues GitHub Pull Requests npm downloads GitHub last commit GitHub contributors smithery badge codecov CodeFactor

This component furnishes an MCP-compliant conduit for interaction between advanced AI agents and external data sources.

English | 中文文档

Component Architecture Overview

fetch-mcp/ ├── src/ # Source code root │ ├── lib/ # Core library modules │ │ ├── fetchers/ # Retrieval implementations │ │ │ ├── browser/ # Headless browser utility set │ │ │ │ ├── BrowserFetcher.ts # Browser interaction logic │ │ │ │ ├── BrowserInstance.ts # Browser lifecycle management │ │ │ │ └── PageOperations.ts # In-page actions │ │ │ ├── node/ # Standard HTTP fetching (Node.js based) │ │ │ └── common/ # Shared transport utilities │ │ ├── utils/ # Auxiliary modules │ │ │ ├── ChunkManager.ts # Data segmentation utility │ │ │ ├── ContentProcessor.ts # Transformation from HTML to clean text │ │ │ ├── ContentExtractor.ts # Sophisticated data capture logic │ │ │ ├── ContentSizeManager.ts # Mechanisms for size governance │ │ │ └── ErrorHandler.ts # Exception management routines │ │ ├── server/ # Server-side framework components │ │ │ ├── index.ts # Server bootstrap │ │ │ ├── browser.ts # Browser orchestration │ │ │ ├── fetcher.ts # Unified fetching API │ │ │ ├── tools.ts # Tool registration handlers │ │ │ ├── resources.ts # Asset management interfaces │ │ │ ├── prompts.ts # Predefined instruction templates │ │ │ └── types.ts # Server-side type definitions │ │ ├── i18n/ # Language localization files │ │ └── types.ts # Shared data structures │ ├── client.ts # MCP client interface code │ └── mcp-server.ts # Primary server execution file ├── index.ts # Application entry point ├── tests/ # Verification scripts └── dist/ # Compiled output directory

MCP Protocol Implementation Details

The Model Context Protocol (MCP) mandates two primary communication paradigms:

  1. Standard Input/Output (Stdio): In this mode, the consuming client initiates the MCP service as a subordinate process, with message exchange occurring via standard I/O streams (stdin/stdout).
  2. Server-Sent Events (SSE): Utilized for streaming messages between the initiator and the service.

This particular implementation exclusively utilizes the Stdio transport mechanism.

Key Capabilities

  • Adherence to the official MCP SDK specifications.
  • Robust support for Stdio communication.
  • Versatile web acquisition methods (handling HTML, JSON, raw text, Markdown, and clean text transformation).
  • Adaptive Mode Selection: Intelligent toggling between lightweight requests and full browser simulation.
  • Context Limitation Mitigation: Automatic segmentation of overly large retrieved documents into manageable segments for constrained AI context windows.
  • Segmented Retrieval: Capacity to request arbitrary segments of previously segmented content, ensuring contextual integrity.
  • Detailed diagnostics streamed to standard error (stderr).
  • Full bilingual support (English and Mandarin Chinese).
  • Modular codebase structure for simplified upkeep and expansion.
  • Advanced Content Sifting: Leverages Mozilla's Readability framework to isolate primary content, effectively suppressing parasitic elements like advertisements and navigational clutter.
  • Meta-data Capture: Extraction of key webpage attributes, including document title, authorship, publishing timestamp, and source site identification.
  • Substantial Content Validation: Automated checking to filter out non-substantive pages such as login portals or error screens.
  • Browser Feature Augmentation: Capabilities covering page manipulation (scrolling), session cookie handling, explicit element waiting, and other sophisticated browser interactions.

Deployment

Acquisition via Smithery

For automated deployment of Unified Web Data Retriever MCP Service into Claude Desktop using Smithery:

bash npx -y @smithery/cli install @lmcc-dev/mult-fetch-mcp-server --client claude

Local Setup

bash pnpm install

Global Installation

bash pnpm add -g @lmcc-dev/mult-fetch-mcp-server

Alternatively, execute directly using npx without persistent installation:

bash npx @lmcc-dev/mult-fetch-mcp-server

Integration with Claude Desktop

To enable this service within the Claude desktop application, modify the server configuration file:

Configuration File Location

  • MacOS: ~/Library/Application Support/Claude/claude_desktop_config.json
  • Windows: %APPDATA%/Claude/claude_desktop_config.json

Configuration Blueprints

This approach abstracts the need for absolute path specification and works well with global installations or direct npx invocation:

{ "mcpServers": { "unified-web-data-retriever-mcp-service": { "command": "npx", "args": ["@lmcc-dev/mult-fetch-mcp-server"], "env": { "MCP_LANG": "en" // Language setting: "zh" or "en" } } } }

Method 2: Specifying Absolute Path

Use this if targeting a specific installation directory:

{ "mcpServers": { "unified-web-data-retriever-mcp-service": { "command": "path-to/bin/node", "args": ["path-to/@lmcc-dev/mult-fetch-mcp-server/dist/index.js"], "env": { "MCP_LANG": "en" // Language setting: "zh" or "en" } } } }

Please substitute placeholder paths with your actual Node.js executable location and the project's root directory.

Usage Demonstration

Upon successful configuration and restarting Claude, the following tools become available for interaction:

  • fetch_html: Retrieves the raw hypertext structure.
  • fetch_json: Obtains data formatted as JSON.
  • fetch_txt: Fetches raw, unformatted text.
  • fetch_markdown: Fetches content represented in Markdown syntax.
  • fetch_plaintext: Retrieves HTML content stripped entirely of markup tags.

Compilation

bash pnpm run build

Executing the Service

bash pnpm run server

Alternatively

node dist/index.js

If globally installed

@lmcc-dev/mult-fetch-mcp-server

Or via npx

npx @lmcc-dev/mult-fetch-mcp-server

Development Client Utilities

Note: The client.js utility is intended strictly for prototyping and validation. When integrated with Claude or similar systems, the AI orchestrates the service, handling chunking logic transparently.

Command Line Interface

bash pnpm run client

Example invocation

pnpm run client fetch_html '{"url": "https://example.com", "debug": true}'

Demo Client Chunk Control Arguments

Parameters available when testing segmentation using the CLI client:

  • --all-chunks: Flag to sequentially pull every generated segment (testing utility only).
  • --max-chunks: Cap on the total number of segments retrieved (default limit set to 10).

Live Output Showcase

The client.js script supports real-time data streaming:

bash node dist/src/client.js fetch_html '{"url":"https://example.com", "startCursor": 0, "contentSizeLimit": 500}' --all-chunks --debug

This command demonstrates sequential fetching and immediate display of data segments, highlighting real-time large content processing.

Verification Suite

bash

Execute tests for MCP protocol interactions

npm run test:mcp

Execute regression tests against mini4k.com data

npm run test:mini4k

Execute direct client function call verification

npm run test:direct

Language Configuration

Bilingual support (English and Chinese) is configurable via an environment variable:

Environment Variable Control

Set the MCP_LANG variable to dictate the operational language:

bash

Engage English mode

export MCP_LANG=en npm run server

Engage Chinese mode

export MCP_LANG=zh npm run server

Windows context

set MCP_LANG=zh npm run server

Environment variables ensure consistent language application across all related service processes.

Default Language Selection Hierarchy

The system defaults language based on this priority: 1. Explicitly set MCP_LANG environment variable. 2. OS locale check (if language starts with "zh", select Chinese). 3. English (final fallback).

Diagnostic Logging

Per MCP specification, operational logs are suppressed by default to prevent corruption of the JSON-RPC stream. Diagnostics are enabled via request parameters:

Utilizing the debug Flag

Enable verbosity for specific tool invocations:

{ "url": "https://example.com", "debug": true }

Verbose output prefixes utilize standard error (stderr) as follows:

[MCP-SERVER] MCP server starting... [CLIENT] Fetching URL: https://example.com

Persistent Debug Log

When debugging is active, all diagnostic streams are mirrored to a file located at:

~/.mult-fetch-mcp-server/debug.log

This log file is accessible through the MCP resources interface:

typescript // Fetching the debug log content const result = await client.readResource({ uri: "file:///logs/debug" }); console.log(result.contents[0].text);

// Command to erase the debug log const clearResult = await client.readResource({ uri: "file:///logs/clear" }); console.log(clearResult.contents[0].text);

Proxy Configuration Strategies

This component accommodates proxy settings via several mechanisms:

1. Explicit Parameter Inclusion

Define the proxy directly within the tool arguments:

{ "url": "https://example.com", "proxy": "http://your-proxy-server:port", "debug": true }

2. Environment Variable Interception

The service automatically recognizes and utilizes standard system proxy environment variables:

bash

Configure proxy variables

export HTTP_PROXY=http://your-proxy-server:port export HTTPS_PROXY=http://your-proxy-server:port

Start the service

npm run server

3. Operating System Setting Discovery

The service attempts to dynamically query system-level proxy configurations:

  • Windows: Queries environment variables using system commands.
  • macOS/Linux: Queries environment variables using system commands.

4. Proxy Troubleshooting Guide

If proxy functionality is inconsistent:

  1. Activate debug: true to inspect detailed proxy detection logs.
  2. Manually specify the proxy via the proxy parameter.
  3. Verify the proxy URL adheres to the http://host:port or https://host:port structure.
  4. If browser emulation is required, ensure useBrowser: true is set.

5. Browser Mode Proxy Behavior

When employing browser simulation (useBrowser: true), proxy resolution follows this precedence:

  1. Explicitly supplied proxy in the request parameters.
  2. Detected system-wide proxy settings.
  3. No proxy utilized if previous checks fail.

Browser mode is critical for resources protected by anti-scraping mechanisms that require dynamic rendering.

Parameter Governance

The system manages configuration inputs as follows:

  • debug: Governed on a per-request basis via call arguments.
  • MCP_LANG: Inferred from environment variables, governing the global server language context.

Operational Usage

Instantiating a Client (SDK Example)

typescript import { Client } from '@modelcontextprotocol/sdk/client/index.js'; import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js'; import path from 'path'; import { fileURLToPath } from 'url';

// Determine current file location const __filename = fileURLToPath(import.meta.url); const __dirname = path.dirname(__filename);

// Setup transport layer via child process communication const transport = new StdioClientTransport({ command: 'node', args: [path.resolve(__dirname, 'dist/index.js')], stderr: 'inherit', env: { ...process.env // Propagate existing environment variables } });

// Initialize the client interface const client = new Client({ name: "example-client-app", version: "1.0.0" });

// Establish connection await client.connect(transport);

// Invoke a tool call const result = await client.callTool({ name: 'fetch_html', arguments: { url: 'https://example.com', debug: true // Local debug override } });

if (result.isError) { console.error('Retrieval Failure:', result.content[0].text); } else { console.log('Retrieval Successful!'); console.log('Content Snippet:', result.content[0].text.substring(0, 500)); }

Available Functions (Tools)

  • fetch_html: Retrieves the complete HTML source.
  • fetch_json: Retrieves data structured as JSON.
  • fetch_txt: Retrieves content as raw, unformatted text.
  • fetch_markdown: Retrieves content formatted using Markdown syntax.
  • fetch_plaintext: Retrieves content derived from HTML, stripped of all tags.

Resource Management Capabilities

The service supports MCP's resource listing and reading methods, though no internal resources are pre-registered. This framework is designed for accessing documentation or component files.

Resource Interaction Example

typescript // Query for accessible resources const resourcesResult = await client.listResources({}); console.log('Discovered resources:', resourcesResult);

// Note: This currently yields empty lists until internal resources are defined.

Prompt Template System

The server offers pre-compiled instruction sets for common operations:

  • fetch-website: General web data retrieval, adjustable for format and browser utilization.
  • extract-content: Focused data capture using CSS selectors and output type specification.
  • debug-fetch: Diagnostic template for analyzing fetch failures and proposing remedies.

Prompt Interaction

  1. Use prompts/list to enumerate available templates.
  2. Use prompts/get to retrieve the content of a specific template.

typescript // List available prompt definitions const promptsResult = await client.listPrompts({}); console.log('Available prompts:', promptsResult);

// Example: Generating a prompt to fetch website content const fetchPrompt = await client.getPrompt({ name: "fetch-website", arguments: { url: "https://example.com", format: "html", useBrowser: "false" } }); console.log('Fetch website prompt details:', fetchPrompt);

// Example: Diagnostic prompt generation const debugPrompt = await client.getPrompt({ name: "debug-fetch", arguments: { url: "https://example.com", error: "Connection timeout" } }); console.log('Debug fetch prompt details:', debugPrompt);

Universal Parameter Set

All retrieval operations accept the following parameters:

Core Parameters

  • url: The Uniform Resource Locator to target (Mandatory).
  • headers: Custom HTTP request metadata (Optional, defaults to empty object).
  • proxy: Proxy server address (e.g., http://host:port) (Optional).

Network Configuration

  • timeout: Maximum duration for the request in milliseconds (Optional, default 30000).
  • maxRedirects: Limit on HTTP redirection hops (Optional, default 10).
  • noDelay: Boolean flag to disable introduced request backoff periods (Optional, default false).
  • useSystemProxy: Boolean flag to permit utilization of OS proxy configurations (Optional, default true).

Segment Management (For Large Content)

  • enableContentSplitting: Activates automatic segmentation of large documents (Optional, default true).
  • contentSizeLimit: Maximum byte size per segment prior to splitting (Optional, default 50000 bytes).
  • startCursor: Byte index specifying the retrieval starting point (Optional, default 0).

These parameters facilitate the controlled acquisition of expansive web documents, ensuring data fits within downstream AI context boundaries via precise byte-level chunking.

  • chunkId: Identifier assigned to a segmented content set, used when requesting subsequent portions.

Segmented responses furnish metadata enabling subsequent calls using chunkId and startCursor for uninterrupted content flow.

Mode Selection

  • useBrowser: Forces execution within the headless browser environment (Optional, default false).
  • useNodeFetch: Forces standard Node.js HTTP client usage (Optional, default false; mutually exclusive with useBrowser).
  • autoDetectMode: If standard fetch fails (e.g., 403 response), automatically switch to browser mode (Optional, default true). Set to false for strict mode adherence.

Browser-Specific Arguments

  • waitForSelector: CSS selector the browser must confirm visibility for before proceeding (Optional, default 'body').
  • waitForTimeout: Maximum wait time for selector visibility in milliseconds (Optional, default 5000).
  • scrollToBottom: Boolean flag to trigger a full page scroll operation (Optional, default false).
  • saveCookies: Boolean flag to retain session cookies across subsequent browser interactions (Optional, default true).
  • closeBrowser: Boolean flag to immediately terminate the browser instance post-operation (Optional, default false).

Content Refinement Parameters

  • extractContent: Boolean to invoke the core content extraction algorithm (Optional, default false).
  • includeMetadata: Boolean to include structural metadata alongside extracted text (Optional, default false; requires extractContent to be true).
  • fallbackToOriginal: Boolean to revert to raw content if intelligent extraction fails (Optional, default true; requires extractContent to be true).

Diagnostic Parameter

  • debug: Activates verbose logging output (Optional, default false).

Content Refinement Feature Detail

Employ this feature to isolate essential article text, discarding peripheral elements like banners and sidebars:

{ "url": "https://example.com/article", "extractContent": true, "includeMetadata": true }

Successful extraction yields metadata including: - Title - Author/Byline - Source Title - Summary Excerpt - Content Byte Length - Readability Confirmation Flag (isReaderable)

Specialized Operations

To command the closure of the active browser session without executing any data fetching task:

{ "url": "about:blank", "closeBrowser": true }

Proxy Precedence

Proxy configuration resolution order: 1. Value provided in the command line arguments (if applicable). 2. Value specified in the request's proxy parameter. 3. Environment variables (if useSystemProxy remains true). 4. Git configuration settings (if useSystemProxy remains true).

Setting the proxy parameter overrides the useSystemProxy flag, forcing direct parameter use.

Diagnostic Output Details

When debug: true is active, logs directed to stderr are prefixed to identify the generating subsystem: - [MCP-SERVER]: Service framework logs. - [NODE-FETCH]: Node.js native transport logs. - [BROWSER-FETCH]: Headless browser transport logs. - [CLIENT]: Consumer application interaction logs. - [TOOLS]: Tool logic execution logs. - [FETCHER]: High-level retrieval interface logs. - [CONTENT]: Data structure and content handling logs. - [CONTENT-PROCESSOR]: HTML parsing and text conversion logs. - [CONTENT-SIZE]: Data segmentation governance logs. - [CHUNK-MANAGER]: Segment ordering and retrieval logs. - [ERROR-HANDLER]: Exception reporting system logs. - [BROWSER-MANAGER]: Browser lifecycle control logs. - [CONTENT-EXTRACTOR]: Intelligent content isolation logs.

Licensing

Licensed under MIT


Updated by lmcc-dev

See Also

`