web-content-retriever-mcp
An advanced Model Context Protocol (MCP) service designed to intelligently acquire web resources, supporting diverse output formats like HTML, JSON, and text, while featuring robust, dual-language (English/Chinese) capabilities.
Author

lmcc-dev
Quick Info
Actions
Tags
web-content-retriever-mcp
This implementation establishes an MCP-compliant nexus for facilitating dialogue between artificial intelligence agents and external web resources.
English | 中文文档
Architectural Blueprint
fetch-mcp/ ├── src/ # Source code directory │ ├── lib/ # Library files │ │ ├── fetchers/ # Resource acquisition modules │ │ │ ├── browser/ # Headless browser operations │ │ │ │ ├── BrowserFetcher.ts # Browser acquisition logic │ │ │ │ ├── BrowserInstance.ts # Browser lifecycle oversight │ │ │ │ └── PageOperations.ts # DOM interaction routines │ │ │ ├── node/ # Standard Node.js acquisition │ │ │ └── common/ # Shared acquisition utilities │ │ ├── utils/ # Auxiliary functional units │ │ │ ├── ChunkManager.ts # Data segmentation utility │ │ │ ├── ContentProcessor.ts # Transformation of HTML to textual format │ │ │ ├── ContentExtractor.ts # Algorithmic core content parsing │ │ │ ├── ContentSizeManager.ts # Constraints management for payloads │ │ │ └── ErrorHandler.ts # Exception management framework │ │ ├── server/ # Server operational modules │ │ │ ├── index.ts # Server initialization point │ │ │ ├── browser.ts # Browser orchestration │ │ │ ├── fetcher.ts # Core acquisition orchestration │ │ │ ├── tools.ts # Tool registration and command handling │ │ │ ├── resources.ts # Managed asset handling │ │ │ ├── prompts.ts # Predefined interaction templates │ │ │ └── types.ts # Server type definitions contract │ │ ├── i18n/ # Multilingual support implementation │ │ └── types.ts # Global type definitions │ ├── client.ts # MCP client wrapper implementation │ └── mcp-server.ts # Primary MCP server executable module ├── index.ts # Server execution startup file ├── tests/ # Quality assurance routines └── dist/ # Compiled output assets
MCP Communication Contract
The Model Context Protocol (MCP) mandates two primary conveyance mechanisms:
- Standard I/O (Stdio): The orchestrating client initiates the MCP server as a subordinate process, establishing communication pathways via standard input (stdin) and standard output (stdout).
- Server-Sent Events (SSE): Utilized for asynchronous message exchange between the initiating party and this server.
This project exclusively utilizes the Standard Input/Output (Stdio) transport mechanism.
Key Capabilities
- Adherence to the official MCP SDK specifications.
- Native support for Stdio communication protocol.
- Versatile web acquisition methods (handling HTML, JSON, raw text, Markdown, and sanitized text conversion).
- Adaptive Mode Switching: Seamless, automated transitions between direct HTTP requests and full browser execution contexts.
- Payload segmentation: Automatically fractures excessively large retrieved data into smaller segments to comply with language model context windows.
- Segmented data retrieval: Allows requesting specific data partitions based on context continuity requirements.
- Detailed diagnostic logging piped to standard error (stderr).
- Complete Bilingual Support (English and Simplified Chinese).
- Highly modular architecture for enhanced maintainability and extensibility.
- Algorithmic Content Distillation: Leverages the Mozilla Readability engine to isolate primary textual content, effectively eliminating extraneous elements like advertisements and navigational boilerplate.
- Metadata Harvesting: Capability to pull critical page data, including document title, authorship, publication timelines, and source domain context.
- Semantic Content Verification: Automated classification to discern pages containing substantive information versus transient pages (e.g., login forms, error messages).
- Enhanced Browser Interaction: Provides functionalities for programmatic scrolling, cookie state management, and waiting for specific DOM elements to stabilize before capture.
Deployment Instructions
Smithery Integration
For automated deployment and integration with Claude Desktop via Smithery:
bash npx -y @smithery/cli install @lmcc-dev/mult-fetch-mcp-server --client claude
Local Setup
bash pnpm install
Global Availability
bash pnpm add -g @lmcc-dev/mult-fetch-mcp-server
Alternatively, execute directly without persistent installation via npx:
bash npx @lmcc-dev/mult-fetch-mcp-server
Integrating with Claude AI
To enable this utility within the Claude desktop environment, configuration must be added to the appropriate server manifest file:
Configuration File Location
- MacOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%/Claude/claude_desktop_config.json
Configuration Examples
Option 1: Via npx (Recommended Path)
This is the most straightforward method, suitable for globally available executables or direct npx invocation:
{ "mcpServers": { "web-content-retriever-mcp": { "command": "npx", "args": ["@lmcc-dev/mult-fetch-mcp-server"], "env": { "MCP_LANG": "en" // Set desired locale: "zh" or "en" } } } }
Option 2: Explicit Binary Path
Use this when a precise installation location must be referenced:
{ "mcpServers": { "web-content-retriever-mcp": { "command": "path-to/bin/node", "args": ["path-to/@lmcc-dev/mult-fetch-mcp-server/dist/index.js"], "env": { "MCP_LANG": "en" // Set desired locale: "zh" or "en" } } } }
Note: Substitute the placeholder paths with your system's actual Node.js executable and project directory locations.
Once configured, restart the Claude application. The following tools become callable within your conversational context:
fetch_html: Retrieves the raw HTML structure.fetch_json: Acquires structured JSON data.fetch_txt: Fetches unformatted textual output.fetch_markdown: Retrieves content formatted using Markdown syntax.fetch_plaintext: Delivers HTML content stripped entirely of markup tags.
Compilation Process
bash pnpm run build
Executing the Server
bash pnpm run server
Alternatively, using the compiled output:
node dist/index.js
If installed globally:
@lmcc-dev/mult-fetch-mcp-server
Or via npx:
npx @lmcc-dev/mult-fetch-mcp-server
Client Demonstration Utilities
Crucial Caveat: The client.js script serves purely for validation and testing. In live deployment with Claude or similar systems, the AI orchestrator autonomously manages the execution of the segmenting functions.
CLI Testing Interface
Use this interface for development checks:
bash
pnpm run client
Example invocation:
pnpm run client fetch_html '{"url": "https://example.com", "debug": true}'
Testing Segment Control Parameters
These arguments assist in simulating large file handling via the CLI:
--all-chunks: Flag to mandate sequential retrieval of every generated data segment.--max-chunks: Sets an upper bound on the number of segments to process (default limit is 10).
Real-time Payload Streaming Demonstration
This demo client showcases immediate output as segments are processed:
bash node dist/src/client.js fetch_html '{"url":"https://example.com", "startCursor": 0, "contentSizeLimit": 500}' --all-chunks --debug
The client will sequentially pull all resulting data segments and display them instantly, verifying real-time large content assimilation.
Running Automated Checks
bash
Execute core MCP interaction tests
npm run test:mcp
Execute tests targeting the mini4k.com benchmark set
npm run test:mini4k
Execute direct client interface validation tests
npm run test:direct
Locale Configuration
This utility supports both Chinese and English interfaces. Language selection is managed via environment variables:
Environment Variable Control
Set the MCP_LANG variable to dictate the operational locale:
bash
Activate English mode
export MCP_LANG=en npm run server
Activate Chinese mode
export MCP_LANG=zh npm run server
Windows Command Shell:
set MCP_LANG=zh npm run server
Using environment variables ensures consistent language settings across all server subprocesses.
Default Locale Selection Logic
The system defaults language based on the following hierarchy:
1. Explicit setting of the MCP_LANG environment variable.
2. Inspection of the host OS language setting (if it begins with "zh", Chinese is selected).
3. English is the ultimate fallback position.
Diagnostic Output Management
In compliance with MCP standards, no operational logs are output by default to prevent corruption of the JSON-RPC stream. Diagnostics are explicitly enabled via request parameters:
Invoking the debug Parameter
Include "debug": true within any tool argument payload:
{ "url": "https://example.com", "debug": true }
Debug output is streamed to standard error (stderr) prefixed for source identification:
[MCP-SERVER] MCP server starting... [CLIENT] Fetching URL: https://example.com
Persistent Debug Log File
When debugging is active, a consolidated log is also written to:
~/.mult-fetch-mcp-server/debug.log
This file is accessible via the MCP resources API:
typescript // Fetching the contents of the diagnostic ledger const result = await client.readResource({ uri: "file:///logs/debug" }); console.log(result.contents[0].text);
// Command to empty the diagnostic ledger const clearResult = await client.readResource({ uri: "file:///logs/clear" }); console.log(clearResult.contents[0].text);
Proxy Configuration Strategy
This utility supports proxy configuration through several tiered mechanisms:
1. Direct Request Parameter Injection
Specify the proxy directly within the call arguments:
{ "url": "https://example.com", "proxy": "http://your-proxy-server:port", "debug": true }
2. Environment Variable Monitoring
The service automatically inspects and utilizes standard environment variables for proxy discovery:
bash export HTTP_PROXY=http://your-proxy-server:port export HTTPS_PROXY=http://your-proxy-server:port
Start server
npm run server
3. Host System Proxy Discovery
The tool attempts to query operating system-level proxy configurations:
- Windows: Reads from environment settings via command utility.
- macOS/Linux: Reads from environment settings via command utility.
4. Proxy Resolution Failure Protocol
If proxy integration proves difficult:
- Activate
debug: trueto scrutinize the proxy detection logs. - Manually override using the
proxyparameter. - Verify the proxy URL adheres to the required format (
http://host:portorhttps://host:port). - For operations demanding browser execution, set
useBrowser: true.
5. Browser Context Proxy Handling
When browser automation is invoked (useBrowser: true), the proxy resolution sequence is:
- Explicitly provided
proxyparameter. - Detected system-wide proxy settings.
- Defaulting to no proxy if the above fail.
Browser mode is advantageous for navigating sites employing robust anti-scraping countermeasures or those requiring dynamic JavaScript interpretation.
Parameter Inheritance Logic
debug: Controlled on a per-request basis via the tool call arguments.MCP_LANG: Inferred from environment variables; governs the global localization state of the server instance.
Operational Usage
Instantiating a Client Connector
typescript import { Client } from '@modelcontextprotocol/sdk/client/index.js'; import { StdioClientTransport } from '@modelcontextprotocol/sdk/client/stdio.js'; import path from 'path'; import { fileURLToPath } from 'url';
// Determine script directory const __filename = fileURLToPath(import.meta.url); const __dirname = path.dirname(__filename);
// Establish the communication pipe const transport = new StdioClientTransport({ command: 'node', args: [path.resolve(__dirname, 'dist/index.js')], stderr: 'inherit', env: { ...process.env // Forward all current environmental variables } });
// Initialize the primary client agent const client = new Client({ name: "example-client", version: "1.0.0" });
// Link client to the transport layer await client.connect(transport);
// Execute a tool call const result = await client.callTool({ name: 'fetch_html', arguments: { url: 'https://example.com', debug: true // Local diagnostic enablement } });
if (result.isError) { console.error('Acquisition Failure:', result.content[0].text); } else { console.log('Acquisition Success!'); console.log('Payload snippet:', result.content[0].text.substring(0, 500)); }
Available Toolset
fetch_html: Retrieve the Document Object Model (DOM) content.fetch_json: Retrieve data encoded in JSON format.fetch_txt: Retrieve raw, unformatted textual data.fetch_markdown: Retrieve content conforming to Markdown specifications.fetch_plaintext: Retrieve content sanitized of all HTML markup.
Resource Management Capabilities
The server supports the MCP resource listing (resources/list) and reading (resources/read) endpoints. While the framework is present for accessing project documentation and assets, this specific functionality remains undeveloped.
Resource Interaction Example
typescript // Query for accessible managed assets const resourcesResult = await client.listResources({}); console.log('Managed assets inventory:', resourcesResult);
// Note: This currently returns empty sets as no resources are defined.
Supported Prompt Templates
The system exposes several predefined interaction templates designed to guide complex requests:
fetch-website: A generic handler for retrieving site data across various output modalities and controlling browser activation.extract-content: Optimized for isolating specific components via CSS selectors and enforcing data type contracts.debug-fetch: A specialized template for diagnosing acquisition failures and suggesting remedies.
Prompt Template Utilization
- Use
prompts/listto survey the available template catalog. - Use
prompts/getto retrieve the source text for a specific template.
typescript // Check available prompt templates const promptsResult = await client.listPrompts({}); console.log('Available Prompts:', promptsResult);
// Example: Requesting the website fetching prompt with parameters const fetchPrompt = await client.getPrompt({ name: "fetch-website", arguments: { url: "https://example.com", format: "html", useBrowser: "false" } }); console.log('Fetch Website Template:', fetchPrompt);
// Example: Generating a diagnostic prompt const debugPrompt = await client.getPrompt({ name: "debug-fetch", arguments: { url: "https://example.com", error: "Connection timeout" } }); console.log('Diagnostic Template:', debugPrompt);
Comprehensive Parameter Definitions
Every tool accepts the following arguments for fine-grained control:
Fundamental Settings
url: The Uniform Resource Locator to target (Mandatory).headers: Custom HTTP request metadata (Optional, default:{}).proxy: A string defining the proxy endpoint (http://host:portorhttps://host:port) (Optional).
Network Flow Controls
timeout: Maximum allowed duration for the operation, in milliseconds (Optional, default: 30000).maxRedirects: Limit for URL redirection traversal (Optional, default: 10).noDelay: Boolean flag to bypass randomized request spacing (Optional, default: false).useSystemProxy: Boolean flag to engage system-level proxy lookups (Optional, default: true).
Payload Sizing & Segmentation
enableContentSplitting: Boolean to activate data segmentation for large payloads (Optional, default: true).contentSizeLimit: Byte threshold triggering data fragmentation (Optional, default: 50000).startCursor: Byte offset indicating where to resume data retrieval in segmented content (Optional, default: 0).
These parameters facilitate the management of substantial web data volumes that might exceed the operational buffer of the target AI architecture, enabling retrieval in manageable, contextually continuous blocks.
Segment Addressing
chunkId: A unique identifier assigned to a segmented data set, used subsequently to request the next partition.
When a resource is partitioned, the response includes metadata pointing to the identifier and starting position necessary for the AI to sequence requests for the remaining segments, ensuring complete data integrity via byte-level addressing.
Mode Selection Parameters
useBrowser: Boolean to force the execution environment to be a headless browser (Optional, default: false).useNodeFetch: Boolean to strictly enforce the standard Node.js acquisition method (Optional, default: false; overridesuseBrowser).autoDetectMode: Boolean to enable heuristic switching to browser mode upon standard request failure (e.g., HTTP 403) (Optional, default: true). Setting this to false enforces the explicitly chosen acquisition method.
Browser Automation Specifics
waitForSelector: A CSS selector the browser must resolve before capture (Optional, default: 'body').waitForTimeout: Maximum wait time in milliseconds for the selector resolution (Optional, default: 5000).scrollToBottom: Boolean flag to execute a full page scroll operation in browser mode (Optional, default: false).saveCookies: Boolean to persist session cookies across browser interactions (Optional, default: true).closeBrowser: Boolean to terminate the browser process immediately after the call, regardless of outcome (Optional, default: false).
Content Parsing Controls
extractContent: Boolean to invoke the Readability algorithm for core content extraction (Optional, default false).includeMetadata: Boolean to include structured metadata alongside extracted text (Optional, default false; dependent onextractContent).fallbackToOriginal: Boolean to revert to the raw content if the extraction heuristic fails (Optional, default true; dependent onextractContent).
Diagnostic Flag
debug: Boolean to activate verbose diagnostic reporting (Optional, default false).
Distillation Feature Utilization
Employ the content distillation feature to obtain the substantive core of a web page, discarding peripheral noise like navigation, advertisements, and ancillary sidebars:
{ "url": "https://example.com/article", "extractContent": true, "includeMetadata": true }
The successful result will incorporate the following discovered metadata, if present: - Document Title - Author Byline - Source Website Name - Summary Excerpt - Content Byte Count - Readability Status (isReaderable flag)
Edge Case Operations
To isolate and extract content from a structurally complicated page where extraction might otherwise fail, activate the safeguard:
{ "url": "https://example.com/complex-layout", "extractContent": true, "fallbackToOriginal": true }
To instigate a shutdown of the persistent browser session without executing any retrieval task:
{ "url": "about:blank", "closeBrowser": true }
Proxy Precedence Rules
The mechanism for determining proxy settings follows this prioritized sequence:
1. Proxy string explicitly defined in the command invocation arguments.
2. The proxy argument provided within the tool's parameter JSON.
3. Environment variables (only if useSystemProxy is set to true).
4. System configuration retrieved via tools like Git (only if useSystemProxy is true).
Crucially, if the proxy parameter is present in the tool arguments, useSystemProxy is automatically set to false.
Diagnostic Output Channels
With debug: true enabled, output is routed to stderr and prefixed to identify the originating component:
- [MCP-SERVER]: Server lifecycle and protocol handling messages.
- [NODE-FETCH]: Logs originating from the standard Node.js HTTP client.
- [BROWSER-FETCH]: Logs generated by the headless browser runtime.
- [CLIENT]: Messages related to client-side command handling.
- [TOOLS]: Operational logs from tool execution layers.
- [FETCHER]: Logs from the unifying acquisition interface.
- [CONTENT]: Logs concerning raw data processing and handling.
- [CONTENT-PROCESSOR]: Logs specific to HTML-to-text transformation logic.
- [CONTENT-SIZE]: Logs detailing payload segmentation and size constraints enforcement.
- [CHUNK-MANAGER]: Logs detailing the logic of data partitioning and addressing.
- [ERROR-HANDLER]: Logs from the centralized exception management system.
- [BROWSER-MANAGER]: Logs governing the lifecycle and state of browser instances.
- [CONTENT-EXTRACTOR]: Logs related to the Readability algorithm and content filtering routines.
Licensing
MIT
Revision by lmcc-dev WIKIPEDIA: Business management tools encompass all systems, applications, control mechanisms, computational solutions, and methodologies deployed by organizations to effectively navigate evolving market dynamics, secure competitive advantage, and enhance overall operational efficiency.
== General Overview == Management tools can be categorized by organizational department or functional aspect, such as planning, process management, record-keeping, human resources management, decision support, and oversight. The last decade has seen transformative shifts in these tools driven by rapid technological advancement, creating a challenge for enterprises in selecting optimal solutions. This perpetual drive for cost reduction, sales expansion, customer insight acquisition, and precise product delivery mandates a strategic approach to tool adoption, moving beyond simple reliance on the newest offerings. Management tools must be deliberately chosen and subsequently tailored to the specific operational needs of the organization, rather than forcing organizational structure to conform to the tool's design.
== Frequently Utilized Tools == Data from a 2013 Bain & Company survey revealed global patterns in business tool usage, reflecting regional needs shaped by economic conditions:
The top ten categories identified included:
Strategic planning frameworks Customer Relationship Management (CRM) systems Employee satisfaction surveys Performance benchmarking Balanced Scorecard implementation Core competency analysis Outsourcing governance Organizational change management programs Supply Chain Optimization Definition of corporate mission and vision Market segmentation analysis Total Quality Management (TQM)
== Enterprise Software Applications == Software solutions or collections of programs utilized by personnel to execute varied corporate functions are termed business applications. These tools aim to elevate productivity, quantify results, and perform complex corporate tasks with accuracy. The evolution proceeded from Management Information Systems (MIS) to comprehensive Enterprise Resource Planning (ERP) systems, later incorporating Customer Relationship Management (CRM), culminating in the current landscape dominated by cloud-based business management platforms. While a tangible correlation exists between IT investment and organizational performance, value realization hinges critically on two factors: the efficacy of the deployment process and the careful selection and customization of the necessary tools.
== Tools Tailored for Small and Medium Enterprises (SMEs) == Tools designed specifically for SMEs are vital as they furnish pathways for resource conservation, enabling smaller entities to compete effectively by leveraging focused, scalable technologies.

