retrieve-web-data
Acquires online material across multiple formats, including structured data (JSON), marked-up documents (HTML), plain text, and Markdown, sourced from specified Uniform Resource Locators (URLs). It is capable of yielding raw markup or deserialized data structures from diverse web endpoints.
Author

zcaceres
Quick Info
Actions
Tags
Web Content Retrieval Utility Server
This specialized MCP service facilitates the retrieval of digital content from the internet, supporting outputs in HTML, JSON object format, unformatted text, and Markdown rendering.
Functional Units (Tools)
fetch_html
- Retrieves a remote webpage and outputs its raw HyperText Markup Language structure.
- Parameters:
url(String, Mandatory): The complete web address to target.headers(Object, Optional): Supplementary HTTP request headers for customization.max_length(Number, Optional): Cap on the data size retrieved (defaults to 5000, adjustable via environment setting).start_index(Number, Optional): Offset for paginated retrieval, used withmax_length(default is zero).
- Output: The verbatim HTML payload received from the source.
fetch_json
- Pulls a JSON resource from a provided Uniform Resource Locator.
- Parameters:
url(String, Mandatory): The URL pointing to the JSON document.headers(Object, Optional): Customization options for the HTTP request headers.max_length(Number, Optional): Maximum byte count to process (default 5000).start_index(Number, Optional): Starting byte position for segmented fetching (default 0).
- Output: The deserialized, usable JSON object structure.
fetch_txt
- Accesses a web page and returns only its textual content, stripping all markup.
- Parameters:
url(String, Mandatory): The target web page address.headers(Object, Optional): Any required custom request headers.max_length(Number, Optional): The upper bound for content capture (standard limit is 5000).start_index(Number, Optional): Parameter for sequential data acquisition (starts at 0).
- Output: The clean, plain text representation of the document, excluding presentation tags, scripts, and stylesheets.
fetch_markdown
- Fetches a web page and transforms its content into Markdown formatting.
- Parameters:
url(String, Mandatory): The URL of the document to be fetched.headers(Object, Optional): Custom headers to attach to the outbound request.max_length(Number, Optional): Limit on the amount of data to retrieve (default 5000).start_index(Number, Optional): Byte offset for fetching partial content (defaults to the beginning).
- Output: The document's content rendered in Markdown syntax.
Assets (Resources)
This server maintains no persistent storage; operations are stateless, focusing purely on immediate content acquisition and transformation.
Implementation Guide
- Obtain the source repository.
- Install necessary node modules:
npm install - Compile the source code:
npm run build
Execution
To initiate the server process, typically piped via standard input/output:
bash npm start
This launches the Fetch MCP Server interface.
Configuration Variables (Environment)
- DEFAULT_LIMIT: Establishes the default size threshold for fetches (setting to 0 bypasses this limit).
Integration with Client Applications
To connect this utility to a local desktop application, configure the server settings as follows:
{ "mcpServers": { "fetch": { "command": "npx", "args": [ "mcp-fetch-server" ], "env": { "DEFAULT_LIMIT": "50000" // Example of setting a higher default size cap } } } }
Key Capabilities
- Utilizes contemporary web fetching mechanisms (Fetch API).
- Permits user-defined request headers.
- Offers data retrieval across four distinct output formats: HTML, JSON, raw text, and Markdown.
- Employs JSDOM for accurate parsing of HTML structure and text isolation.
- Leverages TurndownService for reliable conversion from HTML to Markdown.
Development Lifecycle
- Execute
npm run devto activate the TypeScript compiler in continuous monitoring mode. - Run
npm testto execute the automated quality assurance suite.
Licensing
This software is distributed under the terms of the MIT License.
WIKIPEDIA: XMLHttpRequest (XHR) is an Application Programming Interface structured as a JavaScript object designed to dispatch HTTP requests from a web browser to a server. These methods enable a client-side application to communicate with the server asynchronously post-page load, facilitating data reception. XMLHttpRequest forms a fundamental element of Ajax programming paradigms. Before its widespread adoption, server interaction was primarily achieved through standard hyperlink navigation and form submissions, actions that typically resulted in a full-page refresh. == Origin Story == The foundational concept for XMLHttpRequest was conceived in the year 2000 by the development team at Microsoft Outlook. This concept was subsequently materialized within the Internet Explorer 5 browser release in 1999. Notably, the initial implementation did not employ the standardized XMLHttpRequest identifier; instead, developers utilized constructor calls like ActiveXObject("Msxml2.XMLHTTP") and ActiveXObject("Microsoft.XMLHTTP"). By the release of Internet Explorer 7 in 2006, all contemporary browsers universally supported the XMLHttpRequest identifier. The XMLHttpRequest identifier is now recognized as the established baseline across all major browser engines, including Mozilla's Gecko (since 2002), Safari 1.2 (2004), and Opera 8.0 (2005). === Standardization === The World Wide Web Consortium (W3C) formally published an initial Working Draft specification for the XMLHttpRequest object on April 5, 2006. A subsequent Level 2 specification was released by the W3C on February 25, 2008, introducing enhancements such as progress monitoring methods, cross-origin request permissions, and byte stream handling capabilities. Towards the close of 2011, the features designated for Level 2 were integrated back into the primary specification document. In late 2012, responsibility for maintenance transitioned to the WHATWG, which sustains the document as a living standard utilizing Web IDL definitions. == Operational Sequence == Interacting with a server using XMLHttpRequest generally involves a sequence of programming actions: Create an instance of the XMLHttpRequest object via its constructor. Invoke the open method to define the request type (e.g., GET, POST), specify the target resource URI, and select synchronous or asynchronous execution mode. If utilizing an asynchronous operation, establish an event handler function to process state transitions. Initiate the transmission of the request payload using the send method. The application must react to state changes within the designated event listener. Upon reception of server data, this information is typically stored in the responseText attribute. When the object finalizes processing the complete response, its state transitions to 4, the "done" state. Beyond these fundamental steps, XMLHttpRequest offers extensive controls over request transmission parameters and response processing strategies. Custom header fields can be appended to modify server behavior, and data can be uploaded during the send call. The retrieved response can be automatically parsed from JSON format into an immediately usable JavaScript structure, or it can be processed incrementally as segments arrive, avoiding wait times for the full payload. Furthermore, requests can be canceled prematurely or configured to timeout if completion is not achieved within a set timeframe. == Cross-Origin Communication == Early in the World Wide Web's evolution, restrictions were observed regarding the ability to fetch resources from domains external to the originating site, a limitation that threatened to break many planned applications.
