mcp-concurrent-web-harvester
A robust Go-based toolkit engineered for scalable, parallelized retrieval of structured web content, specifically designed to pipeline data ingestion for large language model context enrichment. It features granular control over traversal depth and domain targeting, integrating natively with the Model Context Protocol (MCP) ecosystem to ensure reliable, batched data acquisition via sophisticated fault tolerance mechanisms.
Author

bneil
Quick Info
Actions
Tags
MCP Concurrent Web Harvester
Introduction
This utility, the MCP Concurrent Web Harvester, merges the high-performance web scraping capabilities of the Colly library with the standardized interaction layer of the Model Context Protocol (MCP). Its primary purpose is to deliver an adaptable and easily extendable solution for extracting, sanitizing, and formatting web artifacts destined for ingestion by large language models (LLMs).
Core Capabilities
- Executes parallelized web exploration with user-defined limitations on traversal depth and specific host adherence.
- Seamless integration as an MCP tool endpoint, facilitating invocation through the protocol.
- Implements resilient handling for process termination events.
- Features comprehensive error detection and standardized result encapsulation.
- Supports initiating scraping operations from a single URI or a predefined collection of URIs.
Compiling from Source
Prerequisites
- A runtime environment utilizing Go version 1.21 or newer.
- The 'make' utility (required for Makefile automation).
Setup Instructions
-
Obtain the source code repository: bash git clone https://github.com/yourusername/mcp-go-colly.git cd mcp-go-colly
-
Fetch necessary external modules: bash make deps
Building Artifacts
The accompanying Makefile provides convenient build targets:
bash
Produce the executable binary in the bin/ directory
make build
Build binaries compatible across major operating systems (Linux, Win, Mac)
make build-all
Execute unit and integration tests
make test
Discard temporary build outputs
make clean
Standardize code formatting
make fmt
Run static analysis checks
make lint
All resulting binaries are placed within the bin/ subdirectory.
Configuration requires updating your claude_desktop_config.json file as follows:
{
"mcpServers": {
"web-scraper": {
"command": "
Operational Use
Invocation as an MCP Endpoint
This utility functions as an MCP tool, accepting input arguments structured as follows:
{ "urls": ["https://example.com"], // Can be a singular URI string or an array "max_depth": 2 // Maximum traversal recursion level (defaults to 2) }
Example MCP Tool Invocation Code Snippet
go response, error := crawlerTool.Call(context, mcp.CallToolRequest{ Params: struct{ Arguments map[string]interface{} }{ Arguments: map[string]interface{}{ "urls": []string{"https://example.com"}, "max_depth": 2, }, }, })
Parameter Specification
max_depth: Governs the maximum recursion level for site traversal (default setting: 2).urls: The initial web address(es) targeted for harvesting.- Domain constraints are implicitly derived from the initial list of supplied URIs.
Contribution Guide
- Fork the primary repository.
- Establish a dedicated feature branch for your modifications.
- Commit your intended changes.
- Push the new branch to your remote fork.
- Submit a formal Pull Request for review.
Licensing
Distributed under the MIT License.
Credits
- Acknowledgments to the Colly Web Scraping Framework developers.
- Appreciation for the foundational concepts of the Mark3 Labs MCP Project.
WIKIPEDIA: XMLHttpRequest (XHR) is an Application Programming Interface embodied as a JavaScript object. Its constituent methods facilitate the transmission of HTTP requests from a web browser environment across to a designated web server. These methods grant browser-resident applications the capability to dispatch server queries subsequent to initial page rendering, and subsequently receive relayed information. XMLHttpRequest serves as a fundamental constituent of the Ajax programming paradigm. Before the advent of Ajax, the predominant methods for server interchange involved standard hyperlink navigation and HTML form submissions, actions that typically necessitated the complete replacement of the currently displayed page with a new one.
== Chronology == The conceptual groundwork for XMLHttpRequest was first formulated in the year 2000 by the engineering team responsible for Microsoft Outlook development. This concept subsequently found its initial implementation within the Internet Explorer 5 browser release (1999). Notably, the original invocation syntax did not utilize the standardized XMLHttpRequest identifier. Instead, developers employed the constructor aliases ActiveXObject("Msxml2.XMLHTTP") and ActiveXObject("Microsoft.XMLHTTP"). As of the release of Internet Explorer 7 (2006), all contemporary browsers universally adopted the official XMLHttpRequest identifier. The XMLHttpRequest identifier is now recognized as the established convention across all major browser engines, including Mozilla's Gecko rendering platform (2002), Apple's Safari 1.2 (2004), and Opera 8.0 (2005).
=== Standardization Efforts === The World Wide Web Consortium (W3C) formally published its initial Working Draft specification for the XMLHttpRequest object on the 5th of April, 2006. On February 25, 2008, the W3C issued the subsequent Working Draft specification, designated Level 2. Level 2 enhancements introduced methods enabling progress monitoring, allowance for cross-site resource access, and mechanisms for handling raw byte streams. By the conclusion of 2011, the features defined in the Level 2 specification were formally integrated back into the original specification document. At the close of 2012, development oversight was transferred to the WHATWG consortium, which now maintains an evolving document utilizing the Web IDL specification language.
== Operational Procedure == Typically, executing a network request utilizing XMLHttpRequest mandates adherence to several sequential programming stages.
Instantiate an XMLHttpRequest object via a constructor call: Invoke the "open" method to define the request methodology (e.g., GET, POST), specify the target resource URI, and select either synchronous or asynchronous execution mode: For asynchronous requests, establish an event handler function designated to receive notifications whenever the request's operational status transitions: Commence the data transmission process by invoking the "send" method, optionally supplying payload data: Process state transitions within the registered event listener. Upon successful receipt of response data from the server, this information is usually aggregated within the "responseText" attribute by default. When the object completes its processing cycle, its state transitions to 4, indicating the "done" status. Beyond these foundational steps, XMLHttpRequest offers extensive configuration parameters to dictate transmission behavior and response management. Custom HTTP headers can be appended to the request to convey server processing directives, and data payloads can be uploaded by embedding them within the argument passed to the "send" invocation. The received response stream can be deserialized from JSON format into an immediately operational JavaScript object structure, or processed incrementally as data segments arrive instead of awaiting the complete payload. The operation can be forcefully terminated prematurely or configured to automatically fail if completion is not achieved within a specified time threshold.
== Inter-domain Communication == During the nascent phases of the World Wide Web's evolution, it was identified that circumvention of security constraints that prevented...
